Performance Comparison of Machine Learning Algorithms for Estimating the Soil Salinity of Salt-A ﬀ ected Soil Using Field Spectral Data

: Salt-a ﬀ ected soil is a prominent ecological and environmental problem in dry farming areas throughout the world. China has nearly 9.9 million km 2 of salt-a ﬀ ected land. The identiﬁcation, monitoring, and utilization of soil salinization have become important research topics for promoting sustainable progress. In this paper, using ﬁeld-measured spectral data and soil salinity parameter data, through analysis and transformation of spectral data, ﬁve machine learning models, namely, random forest regression (RFR), support vector regression (SVR), gradient-boosted regression tree (GBRT), multilayer perceptron regression (MLPR), and least angle regression (Lars) are compared. The following performance measures of each model were evaluated: the collinear problems, handling data noise, stability, and the accuracy. In terms of these four aspects, the performance of each model on estimating soil salinity is evaluated. The results demonstrate that among the ﬁve models, RFR has the best performance in dealing with collinearity, RFR and MLPR have the best performance in dealing with data noise, and the SVR model is the most stable. The Lars model has the highest accuracy, with a determination coe ﬃ cient (R 2 ) of 0.87, ratio of performance to deviation (RPD) of 2.67, root mean square error (RMSE) of 0.18, and mean absolute percentage error (MAPE) of 0.11. Then, the comprehensive comparison and analysis of the ﬁve models are carried out, and it is found that the comprehensive performance of RFR model is the best; hence, this method is most suitable for estimating soil salinity using hyperspectral data. This study can provide a reference for the selection of regression methods in subsequent studies on estimating soil salinity using hyperspectral data.


Introduction
Salt-affected soil is a general term that refers to saline soil and alkaline soil. The content of soluble salt substances in saline soil typically exceeds 2 g/kg, which affects the normal development of crops. Alkaline soil is classified according to the alkalization degree and has a pH that exceeds 8 [1]. Typically, salt soil and alkaline soil are mixed; hence, they are collectively referred to as salt-affected soil. Salt-affected soil is a prominent ecological and environmental problem in the world's dry farming areas [2][3][4]. The total area of the salt-affected soil resources in China is approximately 9.9 million km 2 , which are mainly distributed in the northeast plain, the arid and semi-arid areas in the northwest, the Huang-Huai-Hai plain, and the eastern coastal areas. Among them, the arid area in the northwest is China's largest salt-affected soil distribution area, with a total area of approximately 1.3 million km 2 , which includes Qinghai, Xinjiang, western Inner Mongolia, Gansu Hexi Corridor, and northern Ningxia.
The second-largest is a coastal salt-affected soil region with an area of approximately 0.8 million km 2 . This region is mainly distributed along the coasts of the Yellow Sea, the Bohai Sea, and the East China Sea [5]. The identification, monitoring, prevention, development, and utilization of soil salinization have become important research topics for social and economic development and for promoting sustainable progress. In recent years, with the development of technologies such as geographic information systems, remote sensing technology and global positioning systems, increasingly many remote sensing technologies have been applied to the research and application of monitoring and measurement of salt-affected land [6]. Hyperspectral remote sensing plays an increasingly important role in the identification and monitoring of salt-affected land because of its richer spectral information.
There are three main approaches for studying soil salinization with hyperspectral data: One is to use hyperspectral image data to construct a distribution map of salt-affected land and to analyze the changes and driving factors of salt-affected land [7][8][9][10]. The second is to study the growth of crops under salt stress based on hyperspectral data and to infer the soil salt content from the crop growth [11,12]. The third is to invert the degree of soil salinization by using hyperspectral data to study the relationship between the field hyperspectral data and the measured parameters that are related to the soil salinity [13][14][15][16]. Here, we focus on the use of spectral data to estimate or invert the extent of soil salinization. The methods that are commonly used in this research include spectral decomposition methods and regression analysis methods [17][18][19][20][21]. Spectral decomposition methods are often used to estimate the degree of soil salinization using hyperspectral images. Typically, the pixels of hyperspectral images are mixed pixels. It is necessary to obtain pure endmembers via spectral demixing and other methods and to use the endmembers to estimate soil salinity [18,19]. Regression analysis methods are commonly used to estimate soil salinity based on near-end measured spectra [13,21,22].
The regression methods that are used for soil salinity estimation are mainly divided into traditional regression analysis methods and machine learning methods. Traditional regression analysis methods include least squares regression and partial least squares regression (PLSR) [21,23]. The least squares method identifies the best-matching function to the data by minimizing the sum of the squared errors. However, this method is highly sensitive to outliers, especially when the soil salt content is unevenly distributed and there is a maximum or minimum value [24,25]. The partial least squares regression method can overcome the multicollinearity problem of independent variables in regression analysis by combining principal component analysis and canonical correlation analysis on the basis of ordinary multiple regression. This method shows satisfactory applicability in the face of hyperspectral multidimensional spectral data [26][27][28]. Peng et al. (2019) used PLSR method in regression electrical conductivity (EC) to estimate soil salinity, and the results showed that the accuracy and stability of the model were good [29].
Machine learning methods use algorithms to parse data, to learn from data and to make decisions and predictions about events in the real world. Unlike traditional methods for solving specified tasks, machine learning uses a large amount of data to "train" and learns how to accomplish tasks from the data via various algorithms [30,31]. Commonly used machine learning methods include decision trees, support vector machines, multilayer perceptron, and regularization [30]. Random tree and support vector machines show strong robustness when facing high dimensional data. The hidden layer in the multilayer perceptron can reduce the dependence of the algorithm on the data. The regularization can reduce the influence of collinearity by adding offset to the optimization function [30]. With the deep research of machine learning methods, the results are uneven and many studies in which machine learning methods are used to estimate soil salinity have been reported [4,21,24,[32][33][34]. Jiang et al. monitored soil salinity by integrating multiple biophysical indicators with support vector machine (SVM) and artificial neural network (ANN) regression algorithms. The results demonstrate that the SVM regression algorithm outperforms the ANN algorithm in monitoring soil salinity [4]. Farifteh used the ANN algorithm and the PLSR algorithm to estimate the soil salinity. When using field data for the estimation, the R 2 value that was obtained by the ANN algorithm is 0.42 < R 2 < 0.69 and the R 2 value that was obtained by the PLSR algorithm is 0.8. However, on the experimental data, the R 2 value that was obtained by the ANN algorithm exceeds 0.92 and the R 2 value that was obtained by PLSR is 0.81 [21]. Wang et al. used the PLSR algorithm and the random forest (RF) algorithm to measure soil salinity. According to the validation accuracies, the RF models outperformed the PLSR models [34]. These studies were conducted in different data and different environments, and the results obtained cannot be put together for comprehensive evaluation of these models. It is also unable to comprehensively judge the accuracy of each model and the applicable data of each model. Therefore, it is necessary to analyze and compare the models from multiple perspectives with the same data. In common machine models, random forest regression (RFR) and gradient-boosted regression tree (GBRT) in random tree, support vector regression (SVR) in support vector machine, multilayer perceptron regression (MLPR) in multilayer perceptron, and least angle regression (Lars) in regularization are selected for comparison. This paper makes a comparative analysis of machine learning model (random forest regression (RFR), support vector regression (SVR), gradient-boosted regression tree (GBRT), multilayer perceptron regression (MLPR), and least angle regression (Lars)) from four perspectives. The performance of each model to deal with collinearity problem was compared by inputting different number of bands, adding different degree of Gaussian white noise to compare the performance of each model to deal with data noise, changing the number of training data of the model to compare the stability of each model, and comparing the accuracy of each model by leave-one-out cross-validation method. Then, the performance of each model to estimate soil salt content with hyperspectral data was evaluated comprehensively.

Study Area
The data collection area is Shizuishan City in Ningxia hui autonomous region in northwestern China. Shizuishan City is located between 105 • 58 ~106 • 39 east longitude and 38 • 21 ~39 • 25 north latitude, in the upper reaches of the middle reaches of the Yellow River. Shizuishan City is approximately 88.8 km wide from east to west and 119.5 km long from north to south and at an altitude of 1090~3475.9 m ( Figure 1) [35]. This area has a typical temperate continental climate, with an average temperature of 8.4 to 9.9 • C and an average precipitation of 167.5-188.8 mm. The soil in this area is mainly viscous soil. Because of the abundant sunshine throughout the year, concentrated rainfall, strong evaporation, and high groundwater level, coupled with the irrigation with Yellow River water, the large amount of cultivated land in this area exhibits various degrees of salinization [36].

Data Collection
From April 7 to April 10, 2018, we collected a total of 60 sets of data in Shizuishan. Each set of data was acquired via a five-point sampling method. First, we used a Spectra Vista Corporation (SVC) spectrometer to measure the spectral data without artificially disturbing the land. The related parameter information of the SVC spectrometer is listed in Table 1. The spectral measurement time was between 10 am and 2 pm. To reduce the error during measurement, a whiteboard was used for calibration prior to each measurement. During measurement, the probe is vertically downward and about 60 cm away from the ground. Then, a surface soil sample was collected at the place where the spectrum was measured, placed in an aluminum cassette, returned to the laboratory for processing, and sent to the Ningxia Agricultural Technology Extension Station and the Analytical Testing Center of Beijing Normal University for measurement of the soil salt content and the main salt segregant content. Finally, the latitude and longitude information of the data collection point was obtained by using a handheld GPS instrument at the data collection location [37]. Figure 2 presents the processed field spectra and photos of field sampling points.
Several obvious absorption zones can be seen in Figure 2, among which the water absorption zone is near 1400 nm and 1900 nm, and the clay mineral absorption zone is near 2200 nm [12]. Soil moisture is often a major interfering factor in the inversion of soil substances by using hyperspectral. Numerous studies have shown that the absorption zone of moisture is around 1450 nm and 1940 nm [38][39][40]. Therefore, in the subsequent analysis, the two absorption bands of water will be removed.

Soil Parameters
The pH, soil salinity, and sodium, potassium, magnesium, calcium, chloride, nitrate, sulfate, carbonate, and bicarbonate contents of the soil samples that were collected in the field were measured in the laboratory. The minimum, maximum, and average values ( Table 2) of these quantities for 60 samples were calculated and a data distribution histogram and a normal probability map (Figure 3a,b) was plotted. The pH value of the sampled soil was at least 7.53; hence, all samples were alkaline. The salt content ranged from 2919.76 to 290857.70 mg/kg. The contents of sodium ion, chloride ion, and sulfate ion in the soil salt were high, which are the main constituent ions of the soil salt. According to Figure 3a,b, the soil salt content distribution in the collected samples is not a normal distribution. In most samples, the soil salt content is low and only a small part of the sample has a very high soil salt content. Such sample data may produce high R 2 values; however, the RMSE values could be proportionally high. To avoid this scenario, we transformed the soil salt content data and calculated the base 10 logarithm so that the converted soil salt content data follows a normal positive distribution (Figure 3c,d).

Methods
The overall technical approach of this paper is divided into three main parts ( Figure 4): The first part is data acquisition and preprocessing, which mainly involves the collection of field spectral data, spectral data de-noising, resampling and smoothing, and field soil sample collection processing and parameter measurement. The de-noising and resampling of spectral data are carried out by using the software (SVC HR-1024i, Version 1.17.14) provided by the measuring instrument. Without affecting the overall variation trend of spectral curve, the noise of the data is removed and the data is interpolated into more than 2000 bands. The smoothing of spectral data is carried out by using the five-point method. The second part is the data analysis stage, which mainly transforms the spectral data and analyses the correlations between various forms of spectral data and soil parameter data. The third part is the core part of this paper, namely, the comparative analysis of the model. First, according to the characteristics of the parameters of each model and the data characteristics, after repeated testing, the optimal parameters of each model are determined. Then, 60 groups of sample data were divided into training group data and test group data. Specifically, 50 groups of samples were used for training and the remaining 10 groups of samples were used for testing. In the model comparison, 50 sets of trained data were used for modeling, and the remaining 10 sets of data were tested and compared. Finally, five models are comprehensively compared and analyzed from the four aspects: performance in dealing with collinear problems, performance in dealing with noisy problems, stability and model accuracy.

Data Preprocessing Method
The spectral data that were measured in the field were subjected to a series of denoising and smoothing operations, which were conducted using the software that comes with the instrument. Then, since the sampling method uses a five-point sampling method, the five spectral curves that are measured at five points are averaged to obtain the field spectral data of each sampling point. Finally, the raw spectral (RS) data were transformed into four types, namely, first derivative (FD) [41], continuum removal (CR) [41], square root (SQT) [41], and standard normal variate (SNV) [42], and five types (RS, FD, CR, SQT, SNV) of spectral data were obtained for each sampling point.

Random Forest Regression (RFR)
In the random forest method, a random approach is used to build a forest. There are many decision trees in the forest and there are no correlations among the decision trees. When a new sample is input, each decision tree in the forest evaluates it separately. In the regression problem, the random forest outputs the average of all decision tree outputs [43]. The random forest algorithm is a bagging algorithm and bagging is an integrated learning method. The main strategy is to train multiple weak models to form a strong model. The performance of the strong model is far superior to that of a single weak model [44]. In a random forest, each decision tree "plants" and "grows" in four main steps:

1.
Suppose the training set size is N. The N samples are obtained via repeated multiple sampling with reset. The sampling results will be used as the training set for our decision tree; 2.
If there are M input variables, each node will randomly select m (m < M) variables and use these m variables to determine the best split point. During the generation of the decision tree, the value of m remains unchanged; 3.
Each decision tree grows as much as possible without pruning; 4.
New data are predicted by summing all decision trees (using majority voting in classification and averaging in regression) [43,45].
The RFR predictor is: where x represents an input vector that is composed of various evidential features, k represents the number of regression trees that are constructed in RFR, and T(x) represents each constructed tree [46]. In this paper, RFR is implemented in Python (computer programming language) based on classification and regression tree (CART). CART is a model that is implemented internally by sklearn and spark [47]. Multiple binary decision trees are packaged into RFRs. When using RFR for regression, the mean square error is minimized. The main parameters of the RFR call in Python are "n_estimators" and "max_features." "N_estimators" represents the number of trees in the forest. The larger the value, the higher the performance but also the longer the calculation time. "Max_features" represents the size of the random feature subset that is considered when splitting nodes.

Support Vector Regression (SVR)
SVR is an application of support vector machine (SVM) to the regression problem, namely, to find a regression plane that minimizes the distance of all the data of a set from the plane. For nonlinear models, the data must be projected into the feature space. Then, a linear classifier is used in the feature space. To avoid huge computational complexity when mapping to the feature space, a kernel function is introduced when performing support vector machine regression. Kernel function is the transformation of feature from low dimension to high dimension, but it is first calculated on the low dimension, and the actual classification effect is shown in the high dimension, which avoids the complex calculation of high dimension and can get the same result. [48]. In this article, SVR is implemented in Python. The common parameters when calling the function are kernel and C, where kernel indicates the type of kernel to be used in the algorithm and C indicates the penalty parameter for the error, namely, a regularization parameter, which is a trade-off between the adjustment function complexity and the tolerance of the empirical material [47,49]. More time is required for training if C is larger; however, the prediction results stop improving after a threshold. The SVR function can be expressed as: where N is the number of samples, α i − α * i is the Lagrange multiplier, k(x, x i ) is the kernel function, and b is the bias term [50,51].

Gradient-Boosted Regression Tree (GBRT)
The GBRT is an iterative decision tree algorithm that consists of multiple decision trees and the conclusions of all the trees are summed to obtain the final answer. The GBRT was considered together with the SVM to be a highly generalized algorithm when it was proposed. The core strategy is that each tree learns the residuals of all previous tree conclusions. This residual is the cumulative amount that can be obtained after adding the predicted value [52,53]. The typical parameters that GBRT uses in Python are "learning_rate," "n_estimators," "subsample," "max_depth," and "loss": "learning_rate" represents the learning rate, on which the contribution of each tree depends; "n_estimators" represents the number of boosting stages to be executed; "subsample" represents the sample score of each basic learner that is used for fitting; "max_depth" represents the maximum depth of the regression estimator, where the maximum depth limits the number of nodes in the tree; and "loss" represents the loss function optimization method. It is necessary to balance "learning_rate" with "n_estimators" and "n_estimators" with "subsample" when the model is called [47].
where predict(0) is the initial predicted value, M is the number of trees, s is the scaling factor, and predict(T m ) is the predicted value of each tree [54].

Multilayer Perceptron Regression (MLPR)
Multilayer perceptron regression is also called artificial neural network. The perceptron is a simple neuron model, which is a precursor to large neural networks, and typically consists of an input layer, an output layer, and multiple hidden layers. The multilayer perceptron layer is fully connected to the next layer, namely, each neuron in the upper layer is connected to all neurons in the next layer. The layers are articulated by weights and the neurons are arranged in the layers. In each layer, the nodes receive input only from the nodes in the previous layer and only pass their output to the nodes of the next layer [55,56]. Common parameters for calling MLPR in Python are "hidden_layer_sizes," which specifies the number of neurons in the hidden layer; "activation," which indicates the function that is used in the hidden layer [47]. This model optimizes the squared loss via stochastic gradient descent.
where b (1) and W (1) are the offset vector and the weight matrix, respectively, of the output layer to the hidden layer; s is the activation function of the layer; b (2) and W (2) are the offset vector and the weight matrix, respectively, of the output layer to the hidden layer; and G is the activation function of the layer [57].

Least Angle Regression (Lars)
Least angle regression is a regression algorithm for high-dimensional data. Similar to forward stepwise regression, at each step, it identifies the features that are most relevant to the target. When there are multiple features with equal correlation, instead of continuing along a single feature, it proceeds along an isometric direction between the features [58]. The parameter that is typically specified when LARS is called in Python is "n_nonzero_coefs," which represents the number of targets with non-zero coefficients [47].
where r is the initial state; k is the feature that is most relevant to r; sign f T k r is the forward direction, namely, β k is updated in the direction of sign f T k r ; and δ k is the step size [59].

Ordinary Least Squares (OLS)
Ordinary least squares regression is a linear regression model that identifies the best-matching function for data by minimizing the sum of the squared errors. OLS is implemented in Python. The general form of ordinary least squares is as follows: where x i is the input variable, y i is the measured value, x is the average of the input variables, and y is the average of the measured values [24].

Collinear Problems
In using field hyperspectral data to evaluate the performance of each model in estimating soil salinity, the first evaluation criterion is the performance of each model in dealing with collinear problems. Collinearity problem means that one variable can be represented linearly by several other variables [60]. The field hyperspectral data that we use contains more than 2000 bands, namely, more than 2000 features (variables), which easily causes collinearity among variables. If no variable selection has been conducted or if a suitable regression model has not been selected, false regression can easily occur. In this paper, to evaluate the performance of each model in dealing with collinear problems, each model uses the five variations of the spectrum to conduct modelling regression analyses with 5, 10, 15, 20, 30, 40, 50, and 100 bands. These bands are sorted according to the correlation with the soil salt content from high to low prior to being input into the model and the corresponding number of bands are input in order. The performance of each model in dealing with collinear problems is evaluated by comparing the regression results.

Data Noise Problems
When measuring a spectrum, many errors are generated and the noise is expressed by the spectral curve. Reflections, scattering, and light from other objects in the measurement will enter the probe together with the reflected light of the salt-affected soil, thereby affecting the spectral information of the soil. In addition, the sensor that receives the information will introduce system noise into the spectral information. If the regression model is subsequently applied to satellite imagery, noise from atmospheric radiation will also be introduced [61]. A satisfactory model is necessary for performing regression analysis stably even if there is noise in the data. Gaussian white noise is an ideal model for analyzing channel additive noise, where Gaussian indicates that the probability distribution is a normal function and white noise indicates that its second-order moment is irrelevant and the first-order moment is constant. To evaluate the performance of each model in dealing with noise, the original data are added to Gaussian white noise in MATLAB according to the following signal-to-noise ratios: 5, 10, 15, 20, 30, 40, and 50. The data with various degrees of noise will be added to the regression analysis and the performance of each model in dealing with noise will be evaluated by comparison with the results of the regression analysis.

Stability
When using the model for regression analysis, a large change in the regression result will occur in response to a change in the amount of input data. Hence, the constructed model is only suitable for limited scenarios and the regression results will be unstable under changes in the amount of data [62]. To evaluate the stability of each model, the input data, namely, the training data, are set to 3/4, 2/4, and 1/4 of the original data. The stability performance of each model is evaluated according to the magnitude of the change of each evaluation index as the number of input data changes.

Leave-One-Out Cross-Validation Method
In leave-one-out cross-validation, if the size of the data set D is N, then N-1 data items are used for training and the remaining data item is used for verification. The main disadvantage of using a single data item for verification is that there may be a large difference between the true value and the predicted value. Therefore, in leave-one-out cross-validation, a group is removed from D as a verification set in each round until all the samples have been evaluated, which requires a total of N calculations, and the verification error is averaged [63].

Evaluation Index
To evaluate the performance of the model in various aspects, we use determination coefficient (R 2 ), the ratio of the performance to the deviation (RPD), the root mean square error (RMSE), and the mean absolute percentage error (MAPE) evaluation indicators. Their calculation formulas are as follows: where N is the sample size, Y i is the measured value of soil sample i,Ŷ i is the salt content of soil sample i that is predicted by the models, Y i is the average salt content of soil sample i, SD s is the standard deviation of the measured salt content, and RMSE is the root mean square error of the predicted salt content. Larger values of R 2 and RPD and smaller values of RMSE and MAPE correspond to higher model performance [21,22,64].

Spectrum Analysis
After the transformation of the raw spectral curve in four forms, four spectral curves were drawn for each spectral form, each curve representing different soil salt content ( Figure 5). When the FD spectrum curve is near 600 nm, the higher the soil salt content, the larger the value. The CR spectrum curve shows patterns that differ according to the salt content at 400 nm and 1900 nm. The higher the SQT curve at a fixed salt content, the larger the value of the curve. When the SNV curve is near 1500 nm and 2000 nm, different absorption characteristics and curve trends are observed as the salt content is varied. To evaluate the relationship between the band and the soil salinity, various forms of spectra were correlated with the soil salt content ( Figure 6). The results shown in Figure 6 are those that pass the significance test. The correlations between RS, SQT, and SNV and the soil salt content are strong in the wavelength range of 350-1400 nm, with correlation coefficients that exceed 0.6, and the correlation between the band in the range of 1400-1900 nm and soil salt content is from 0.4-0.6. The bands in which FD is highly correlated with the soil salinity are mainly concentrated from 1400 to 1800 nm and from 2000 to 2100 nm. The bands in which CR and FD are highly correlated with the soil salinity are similar and are mainly concentrated from 400 to 600 nm, from 1300 to 1800 nm, and from 1900 to 2100 nm. To evaluate the relationship among the bands with high correlations with the soil salt content, the curves of the various bands were subjected to autocorrelation analysis (Figure 7a-e). RS, SQT, and SNV have high autocorrelations in the ranges of 300-1900 nm and 2100-2300 nm and the correlation coefficients exceed 0.8. The correlation coefficients of FD between 350 and 500 nm, between 1100 and 1900 nm, and between 2000 and 2200 nm exceed 0.8. The correlation coefficients of CR between 350 and 600 nm, between 1000 and 1400 nm, and between 1400 and 2200 nm exceed 0.8. By comparison, there is also a strong correlation among the bands with high correlations with the soil salt content. Therefore, based on the correlation analysis results in this study, some characteristic wavelengths were selected for modeling while avoiding water absorption characteristics. The bands selected by RS are 455-554 nm, the bands selected by FD are 1372-1397 nm, 1640-1697 nm, 2090-2113 nm, the bands selected by CR are 1635-1734 nm, the bands selected by SQT are 464-563 nm, and the bands selected by SNV are 455-554 nm.

Model Parameter Determination
After considering the characteristics of each model and the characteristics of the data and after conducting repeated tests, the main parameters of each model are determined. The parameters are listed in Table 3.

Analysis on Collinear Problems
The results of the regression analysis of each model with five spectral forms according to various bands are presented in Figures 8 and 9. Figure 8 plots R 2 and RPD between the model estimated value and the measured value. Figure 9 plots RMSE and MAPE between the estimated and measured results. By comparison, it is found that the RFR model performs the best on the collinear problem, except that the regression effect of the FD is abrupt when the input band is 20, whereas the regression results of all other variations do not exhibit sharp changes with the increase of the number of bands. The RFR model is followed by Lars, for which slight fluctuations (in RS and FD) are observed among input bands 5-30. When the number of bands exceeds 30, the regression performance is slightly reduced. Lars is followed by GBRT, which is similar to Lars, except that there are more spectral forms that exhibit fluctuations. In SVR, when the number of input RS bands is greater than 10, the regression result suddenly drops sharply. In MLPR, the regression results in RS, FD, and SQT fluctuated drastically with the number of bands. Therefore, the order of the models from strongest to weakest performance on collinear problems is RFR > Lars > GBRT > SVR > MLPR.

Noise Processing Performance Analysis
The regression results in terms of R 2 , RPD, RMSE, and MAPE of each model as the noise is gradually decreased, namely, as the signal-to-noise ratio is increased, are plotted ( Figure 10). When the signal-to-noise ratio is less than 20, the fluctuation of the estimation results of GBRT is the largest; hence, the regression of this model is not highly robust to noise. GBRT is followed by SVR and Lars, which exhibit slight fluctuations in the signal-to-noise ratio. On noisy problem, models RFR and MLPR are the most stable. Therefore, the order of the models from the strongest to weakest performance on noisy problems is MLPR > RFR > Lars > SVR > GBRT.

Stability Analysis
When the training data and the test data change during modelling, the results of each model are presented in Figure 11. For quantitatively analyzing the changes in the estimation results of each model, the statistics of the changes in the indicators of the model data are presented in Table 4. When the training data are 3/4 of the total data, the smallest change occurs with the RFR model. When the test data are 2/4 and 1/4 of the total data, the smallest change occurs with the SVR model. According of the magnitude of the change, the SVC model is the most stable, followed by the MLPR, RFR, and Lars models. The GBRT model is the least stable. The order of the models from highest to lowest stability under three changes in the data volume is as follows: SVR > MLPR > RFR > Lars > GBRT.

Precision Analysis
The six models were evaluated using the leave-one-out cross-validation method. The results are presented in Figure 12. The Lars model yields the best prediction result. The obtained R 2 is 0.87, the RPD is 2.67, the RMSE is 0.18, and the MAPE is 0.11. The Lars model is followed by the RFR, SVR, and MLPR models. The obtained results are both greater than 0.8. GBRT yield the worst modelling results. The order of the models from highest to lowest precision is as follows: Lars > SVR>RFR > MLPR > GBRT.

Band Analysis
In this study, some bands were selected according to the sequence from high to low according to the correlation coefficients based on the correlation analysis results while removing the water absorption band. The selected bands mainly focus on 455-563 nm, 1372-1397 nm, 1640-1734 nm, and 2090-2113 nm. The selected bands are compared with the previous studies ( Figure 13). Csillag et al. (1993) studied the spectral reflectivity of surface soil in the spectral range of 495-2395 nm, and found that the key bands for identifying salt-affected soil were mainly 550-770 nm, 900-1030 nm, 1270-1520 nm, 1940-2150 nm, 2150-2310 nm, and 2330-2400 nm [65]. These bands are similar to the bands used in this study: 455-563 nm, 1372-1397 nm, and 2090-2113 nm. Dehaan et al. (2002) pointed out in the study on the field-derived spectra of salinized soils that the absorption features of spectra were 505, 920, 1415, 1915, and 2205 nm [12]. The absorption features of the spectrum at 505 nm are in the bands used in this study. Zhou et al. (2006) used laboratory hyperspectral data to estimate the physicochemical of reclaimed saline soils and selected six spectral band, namely 448, 530, 670, 880, 1400, and 1900 nm, to discriminate the four saline land and groups [66]. Among them, 448 nm, 530 nm, and 1400 nm are similar to the bands used in this study. Weng et al. (2008) used reflectance spectroscopy to estimate the soil salt content in soils, and pointed out that the reflectance at 1931-2123 nm and 2153-2254 nm was highly correlated with soil salt content [14]. It is the same as the 2090-2113 nm band used in this study according to the correlation analysis results. Wang et al. (2012) modeling and inversion of the effect of salinity on soil reflectance under various moisture conditions were carried out in the laboratory, and some sensitive bands of salt types were obtained. Sensitive band for Na 2 SO 4 type of salt affected soils were identified as from 1920-2230 nm, and 1970-2450 nm for NaCl, 350-400 nm for Na 2 CO 3 [67]. Sidike et al. (2014) used image and spectra to estimate the soil salinity, and the statistical analysis showed that the sensitive bands of soil salinity were 350-436 nm, 516-814 nm, 1445-1506 nm, 1667-1699 nm, 1882-2096 nm, and 2160-2393 nm [68]. Srivastava et al. (2017) used the visible-near infrared reflectance spectroscopy to rapidly identify salt-affected soil and pointed out that the spectral range of 1390-2400 nm was highly sensitive to salinization [69]. The bands used in this study are similar to those in previous studies. Though, this study chooses the bands that cannot fully cover the soil salinity sensitive bands used in previous study, by comparison, this study selects the bands that also can represent the features of the soil salinity. At the same time, through the regression results also can indicated, the bands used in this study can be used to estimate the soil salt content. Figure 13. A comparison between the bands used in this study and those used in previous studies. The red represents the bands used in this study. The yellow represents the bands used in previous studies.

Comprehensive Comparison of Models
According to the comparative analysis, the RFR model performs the best on collinear problems and it also realizes satisfactory performance in terms of noise processing performance, model stability, and model accuracy. The SVR model has the highest stability. GBRT is the best at dealing with collinearity in terms of four aspects, but it is poor at dealing with data noise, model stability, and model accuracy. The MLPR model performs the best on noisy data. The OLS model performs the worst on collinear problems and its accuracy is the lowest. The Lars model is the most accurate. To comprehensively consider the performance of each model, each performance is assigned to one of the six levels: the best performance is assigned to 6 and the worst to 1. Then, the scores of each model are summed to comprehensively evaluate the performance of each model ( Table 5). The overall performance of each model in terms of the four aspects is presented in Figure 14. The RFR model has the best comprehensive performance, followed by SVR, MLPR, and Lars, and GBRT and OLS perform the worst.

The Best Precision Model-Lars Model Depth Analysis
An in-depth analysis was conducted on the most accurate model, namely, Lars. First, the ability of the model on co-linear problems is explored. The number of bands input into the model is incremented from 1 to 100 in steps of 1 unit. By analyzing the regression results of each model, it is found that the regression results of Lars model change little with the number of bands. It shows that this model can process high-dimensional data, and the number of input bands has little influence on the results of the model ( Figure 15). Then, we explore the limitations of the model in dealing with noisy problems. In the signal-to-noise ratio range of 1-50, the performance of the model on noisy problems is explored with a step size of 1. The results are presented in Figure 16. After the analysis, it is found that when the signal-to-noise ratio is less than 23, the Lars model results in severe fluctuations. When the signal-to-noise ratio is greater than 23, the results of the Lars model tend to be stable. Therefore, for the Lars model, when the signal-to-noise ratio of the input data is greater than 23, the results of the model are stable and acceptable. Figure 16. Results of the Lars model on noisy problems. In the signal-to-noise ratio range of 1-50, the analysis is performed with a step size of 1.

Comparison of Traditional Methods and Machine Learning Methods
The above research shows that most machine learning methods show good performance in soil salinization estimation. However, in the study, if the traditional regression method can show good performance, it is not necessary to choose complex machine learning method. Therefore, a comparison needs to be made between the traditional methods and machine learning methods. OLS and PLSR models were selected for comparison in the traditional method, while RFR model with the best comprehensive performance and Lars model with the highest accuracy were selected for comparison in the machine learning method.
The four models were compared from four aspects ( Figure 17). In dealing with the collinearity problem, the regression results of the OLS model decreased sharply with the increase of the number of bands, indicating that the model cannot cope with the collinear phenomenon in the data. With the increase of the number of bands in the PLSR model, the regression results of the model are stable with slight fluctuations. In dealing with data noise, the regression results of the OLS model and the PLSR model vary greatly with the change of the data signal-to-noise ratio, while the machine learning model is more stable, explaining that these two traditional methods are not as good at processing data noise as machine learning methods. In terms of model stability, the four models showed similar stability. In terms of model accuracy, machine learning model is obviously better than traditional methods. Ma et al. (2019) also compared the accuracy of the machine learning method with the traditional method in their study, and reached the same conclusion, that is, the accuracy of the machine learning algorithm is better than the traditional method [70]. Therefore, by comparison, machine learning methods are more suitable for soil salinization research than traditional methods.

Spectral Difference Analysis Before and After Soil Disturbance
Field measurements of soil spectra were used in the study, and the soil was undisturbed at the time of the measurements. But when the soil was chemically analyzed, it was disturbed, causing the spectrometer to "see" a different salinity than the chemical analysis. Thus, spectral differences between disturbed and undisturbed soils were investigated. Before the soil samples collected from the field were sent to the laboratory for analysis, we air-dried the soil samples, removed the gravel, dead branches, and passed them through a 1-mm sieve, and measured the laboratory spectra of each group of samples in a dark room. We compared the field spectra with laboratory spectra (Figure 18). When the soil salt content is low, the change trend of field spectrum curve is consistent with that of laboratory spectrum curve. Only when the wavelength is greater than 800 nm, the reflectivity of the laboratory spectrum is higher than that of the field spectrum. This is due to the fact that the soil measured in the field contains moisture, which makes the spectral reflectivity of the soil decreased as a whole. With the increase of soil salt content, soil surface crust phenomenon is serious and the field spectral reflectivity gradually increases. At 1900 nm, there is a strong water absorption valley in the field spectrum. In this study, the selected wavelength was mainly concentrated at 400-600 nm, within which the spectral difference before and after soil disturbance was small and had little influence on the research results. However, the spectra before and after the soil disturbance are different to a certain extent. The existence of such difference needs to be considered in the subsequent studies, and the reasons for such difference can be analyzed in depth.

Conclusions
This paper comprehensively analyses the performances of five machine learning models (RFR, SVR, GBRT, MLPR, and Lars) in estimating the soil salinity using field-measured spectral data. In dealing with collinearity problem, RFR model is the best, which has strong processing performance for high-dimensional data. The MLPR and RFR models perform well on data noise problems; thus, these two models can be considered if the data noise is high. The stability of the SVR models is satisfactory and changes in the amount of data have less of an impact on the model. The Lars model has the highest accuracy in estimating the soil salinity. When applying the Lars model for optimal modelling results, the signal-to-noise ratio of the input data should exceed 23.
Through comprehensive analysis of each model, the ranking of the model comprehensive performance from strong to weak is RFR>Lars>SVR>MLPR>GBRT. Through comparison it is found that the comprehensive performance of machine learning model is superior to the traditional regression learning method. Therefore, RFR model can be used as the preferred model for subsequent studies on hyperspectral estimation of soil salt content.