Combination of Hyperspectral and Machine Learning to Invert Soil Electrical Conductivity

: An accurate estimation of soil electrical conductivity (EC) using hyperspectral techniques is of great signiﬁcance for understanding the spatial distribution of solutes and soil salinization. Although spectral transformation has been widely used in data pre-processing, the performance of different pre-processing techniques (or combination methods) on different models of the same data set is still ambiguous. Moreover, extremely randomized trees (ERT) and light gradient boosting machine (LightGBM) models are new learning algorithms with good generalization performance (soil moisture and above-ground biomass), but are less studied in estimating soil salinity in the visible and near-infrared spectra. In this study, 130 soil EC data, soil measured hyperspectral data, topographic factors, conventional salinity indices such as Salinity Index 1, and two-band (2D) salinity indices such as ratio indices, were introduced. The ﬁve spectral pre-processing methods of standard normal variate (SNV), standard normal variate and detrend (SNV-DT), inverse (1/OR) (OR is original spectrum), inverse-log (Log(1/OR) and fractional order derivative (FOD) (range 0–2, with intervals of 0.25) were performed. A gradient boosting machine (GBM) was used to select sensitive spectral parameters. Models (extreme gradient boosting (XGBoost), LightGBM, random forest (RF), ERT, classiﬁcation and regression tree (CART), and ridge regression (RR)) were used for inversion soil EC and model validation. The results reveal that the two-dimensional correlation coefﬁcient highlighted EC more effectively than the one-dimensional. Under SNV and the second order derivative, the two-dimensional correlation coefﬁcient increased by 0.286 and 0.258 compared to the one-dimension, respectively. The 13 characteristic factors of slope, NDI, SI-T, RI, proﬁle curvature, DOA, plane curvature, SI (conventional), elevation, Int2, aspect, S1 and TWI provided 90% of the cumulative importance for EC using GBM. Among the six machine models, the ERT model performed the best for simulation (R 2 = 0.98) and validation (R 2 = 0.96). The ERT model showed the best performance among the EC estimation models from the reference data. The kriging map based on the ERT simulation showed a close relationship with the measured data. Our study selected the effective pre-processing methods (SNV and the 2 order derivative) using one-and two-dimensional correlation, 13 important factors and the ERT model for EC hyperspectral inversion. This provides a theoretical support for the quantitative monitoring of soil salinization on a larger scale using remote sensing techniques.


Introduction
Salinization is a major soil degradation process that threatens the ecosystem, agricultural production and sustainability of the ecological environment in arid and semiarid regions worldwide [1][2][3][4]. There are nearly 10 × 10 8 hm 2 of area harmed by salinization worldwide, and more than 33 million hm 2 in China [5,6]. This has a negative impact on crop yields and agricultural productivity, seriously threatening ecosystem health and economic sustainability. Therefore, timely and accurate monitoring of soil salt content is important to combat soil salinity and improve agricultural productivity in the face of climate change and human activities.
In agricultural production, soil salinization monitoring plays a very important guiding role in crop management [7,8]. Soil electrical conductivity (EC) is widely used in the study of saline soil [9], which can directly reflect soil salinity [10]. EC is an important indicator in measuring the degree of soil salinization and evaluating crop yield at regional scales [11,12].
Hyperspectral remote sensing techniques can fully exploit spectral information in order to realize real-time and non-contact monitoring of soil salinization, which has become a major detection method for soil salinization monitoring at present [13,14]. Due to the complex causes of soil salinization, its spectral characteristics are significantly affected by soil texture, organic matter content, soil moisture, soil salt content and other factors. Hyperspectral resolution is high. The fine spectral resolution reflects the fine characteristics of the ground object spectrum, and inhibits the influence of other interference factors to a large extent. Several scholars have explored models of the function between measured soil EC and spectral reflectance, and have successfully predicted salt content in soils using reflectance spectroscopy [15,16]. The pre-processing of hyperspectral data is the key to improving the accuracy of the inversion. At present, in terms of data pre-processing, the common methods include spectral transformation and the Savitzky-Golay filtering method, among the others [17,18]. However, no single pre-processing technique (or combination method proposed) is suitable for different data sets. Moreover, these studies mainly considered the sensitivity of the spectrum without deeply studying the interaction between the spectral bands. The optimal band combination algorithm overcame this problem by calculating the spectral index (two-band (2D) salinity indices) and reducing the interference of irrelevant wavelengths [19]. It enhances the relationship between soil attributes and spectral features, and minimizes the effects of irrelevant wavelengths [20,21]. Thus, the optimal band combination algorithm has been widely used locally [10,22]. However, this method is not currently used in the study of hyperspectral inversion of soil salinization.
The causes of soil salinization and the composition of soil salinity are complex. They differ according to different regions in the selection of sensitive bands, salinity index, vegetation index, topographic factors and other environmental variables in the remote sensing and monitoring of soil salinity [23][24][25][26]. Although most of these variables can be obtained by band operation, there are different degrees of information redundancy. Therefore, band selection strategies need to be developed [27,28], such as Pearson's correlation coefficient (PCC), gray relational analysis (GRA), and variable importance in projection (VIP). This feature filtering method reduces information redundancy, but it is difficult to obtain the optimal inversion parameter subset. Compared with the above variable optimization method, gradient boosting machines (GBM) can effectively construct and run the enhanced tree, perform parallel computation, and effectively process sparse data [29]. However, it is rarely used in the optimization of characteristic variables in soil salinization modeling.
Soil is a spatio-temporal continuum with high variability, and the non-linear effects of soil-forming factors on soil development lead to obvious varied properties in larger areas [20,30,31]. At present, remote sensing technology is the most effective means of soil surface monitoring, but the lack of data mining has become a bottleneck to high efficiency and high precision monitoring. The research shows that the machine learning (ML) inversion model has strong nonlinear fitting ability and excellent data mining ability, which will increase the use of spectral reflection information [32,33]. Back propagation neural networks (BPNN), support vector machines (SVM), multiple adaptive regression splines (MARS), etc., have all been used to invert soil salt content [34][35][36]. However, the effectiveness of different modelling methods varies. Ensemble learning methods have the advantages of high flexibility and generalization. The new developed extremely randomized trees (ERT) [37] and light gradient boosting machine (LightGBM) [38] are simple and fast learning algorithms. They have shown good results for adsorption energy of metal ions [39], soil moisture [40] and above-ground biomass [41], but are currently less well studied for applications in estimating soil salinity in visible-near-infrared (Vis-NIR) spectra.
Yinchuan Plain irrigation area is located in the upper reaches of the Yellow River, where the salinized soil area is about 2406 km 2 (where alkaline represents takyr solonetzs). Among salinized soils, the area with high salt content is mainly located in Pingluo County in the north, a typical salinized land in Yinchuan Plain, where light, moderate and heavily salinized soils account for 25.2%, 39.8% and 2.7% of the county area, respectively [42]. In addition, the prediction and inversion results of EC by previous pre-treatment technologies and multivariate methods are different due to the region, soil type and spectral range. Few studies have simultaneously explored multiple forms of pre-processing and modeling methods in the same database. This study aims to provide a new train of thought for assessing soil salinization using hyperspectral analysis.
For this purpose, the saline-alkali soil samples were collected and hyperspectral data were acquired in the study area from 2018, 2019 and 2021. Spectral pre-processing methods and machine learning methods were used for data simulation, best model selection and validation. The main objectives of our research were as follows: (1) explore the response of saline properties to sensitive spectral wavelengths; (2) compare the optimal spectral parameters under one-and two-dimensional correlation coefficients, and point out the influencing factors for the EC model; (3) acquire the best-performing model for predicting soil salinity, and map soil salinity in the study region; (4) apply and provide technical support for saline soil evaluation.

Study Area
The study area is located in Pingluo county (38 • 26 60 -39 • 14 09 N, 105 • 57 40 -106 • 52 52 E), northern Yinchuan Plain, Ningxia Province, China (Figure 1), covering an area about 2060 km 2 . The area is located in the irrigated middle and upper reaches of the Yellow River and lies between the diluvial fan and plain at the eastern foot of Helan Mountain. The study area experiences a warm temperate monsoon climate, with an annual mean temperature of 9 • C, low precipitation (annual mean: 150-203 mm), and strong evaporation (annual mean > 1825 mm). The research area is one of the most serious areas of soil salinization in Ningxia Province, due to the low-lying terrain, poor drainage conditions, shallow groundwater depth, strong evaporation, water salinity pooling, the terrain, and unreasonable irrigation. The major types of land use and land cover include water bodies, deserts, wastelands and basic farmland. The major soil types are lime calcite, saline, alkaline, and irrigated silt (Calcite Solonchack, Petrosalic Solonchack, Sodic Solonchack, Haplic Cambisol Salic according to the World Reference Base for Soil Resources (WRB)). The parent materials are mainly carbonate. The natural vegetation is dominated by salt tolerant vegetation (such as Nitraria tangutorum and Phragmites australis) [43].

Soil Sampling and Laboratory Analysis
Based on soil surface features, pH conditions, and land use patterns, nine sampling sites (57 samples) were selected throughout Pingluo County from northern Yinchuan Plain in Ningxia Province in October 2018 ( Figure 1). The sampling sites included basic farmland (non-alkaline or slightly saline-alkaline soil), medium-and low-yielding farmland (moderate-strongly saline-alkaline soil), and abandoned land (strongly saline-alkaline or alkaline soil) with varying levels of alkalinity. At each site, a soil auger was used to collect intact soil cores (0-20 cm in length) at intervals of 30, 60, 100, 200, and 300 m, after conducting the hyperspectral measurements. The sampling method was the same in March 2019 and May 2021. A total of 130 soil samples were collected, including 57 in 2018, 41 in 2019 and 32 in 2021. Sampling was carried out in a 5 km × 5 km grid of sample points ( Figure 1). The collected soil samples (0-20 cm, non-mixed soil samples) were stored in sealed bags until use. The latitude and longitude of the sampling sites were recorded by a handheld global positioning system (GPS). The information including the surface salinity accumulation, land use patterns, vegetation types and cover were documented. After the samples were brought back to the laboratory, the soil moisture content (water content by weight) was determined using the drying method, and the soil EC was determined using the electrical conductivity method [44]. According to the definition of Brady and Weil [45], the soil in the study area was partitioned into five levels of salinity: non saline (0-0.4 dS m −1 ), very slightly saline (0.4-0.8 dS m −1 ) slightly saline (0.8-1.6 dS m −1 ), moderately alkaline (1.6-2.4 dS m −1 ) and strongly alkaline (>2.4 dS m −1 ).

Hyperspectral Measurement and Data Processing
Field spectra were acquired at the time of soil sampling in each year. Soil spectroscopy was conducted at each sampling site using an SR-3500 spectrometer (Spectral Evolution, Esses, MA, USA), at wavelengths of 350 to 2500 nm. The spectral resolution was set at 3.5 nm from 350 to 1000 nm, 10 nm from 1000 to 1500 nm, and 7 nm from 1500 to

Soil Sampling and Laboratory Analysis
Based on soil surface features, pH conditions, and land use patterns, nine sampling sites (57 samples) were selected throughout Pingluo County from northern Yinchuan Plain in Ningxia Province in October 2018 ( Figure 1). The sampling sites included basic farmland (non-alkaline or slightly saline-alkaline soil), medium-and low-yielding farmland (moderate-strongly saline-alkaline soil), and abandoned land (strongly saline-alkaline or alkaline soil) with varying levels of alkalinity. At each site, a soil auger was used to collect intact soil cores (0-20 cm in length) at intervals of 30, 60, 100, 200, and 300 m, after conducting the hyperspectral measurements. The sampling method was the same in March 2019 and May 2021. A total of 130 soil samples were collected, including 57 in 2018, 41 in 2019 and 32 in 2021. Sampling was carried out in a 5 km × 5 km grid of sample points (Figure 1). The collected soil samples (0-20 cm, non-mixed soil samples) were stored in sealed bags until use. The latitude and longitude of the sampling sites were recorded by a handheld global positioning system (GPS). The information including the surface salinity accumulation, land use patterns, vegetation types and cover were documented. After the samples were brought back to the laboratory, the soil moisture content (water content by weight) was determined using the drying method, and the soil EC was determined using the electrical conductivity method [44]. According to the definition of Brady and Weil [45], the soil in the study area was partitioned into five levels of salinity: non saline (0-0.4 dS m −1 ), very slightly saline (0.4-0.8 dS m −1 ) slightly saline (0.8-1.6 dS m −1 ), moderately alkaline (1.6-2.4 dS m −1 ) and strongly alkaline (>2.4 dS m −1 ).

Hyperspectral Measurement and Data Processing
Field spectra were acquired at the time of soil sampling in each year. Soil spectroscopy was conducted at each sampling site using an SR-3500 spectrometer (Spectral Evolution, Esses, MA, USA), at wavelengths of 350 to 2500 nm. The spectral resolution was set Remote Sens. 2022, 14, 2602 5 of 20 at 3.5 nm from 350 to 1000 nm, 10 nm from 1000 to 1500 nm, and 7 nm from 1500 to 2100 nm. Measurements were carried out between 10:00-14:00 on a sunny day. During the hyperspectral measurements, the spectrometer was vertically downwards with the probe at about 80 cm (waist height) above the surface. Before each measurement, the reference panel on the spectrometer was initialized, then five measurements per sampling site were obtained and averaged to minimize instrument noise.

Spectral Reflectance Transformation and Selection of Spectral Indices
In order to eliminate the instrument noise and environmental background interference, the edge bands (350~399 nm and 2401~2500 nm) with excessive noise were removed. The spectral curves consisting of 201 band numbers were obtained by resampling the 400~2400 nm spectral data at 10 nm intervals original spectrum (OR), taking into account the smoothing and features of the spectral curves. Five types of spectrum pre-processing methods, including standard normal variate (SNV), standard normal variate and detrend (SNV-DT), inverse (1/OR), inverse-log (Log(1/OR) and fractional order derivative (FOD) (range 0-2, with intervals of 0.25, 0 order means OR), were implemented on the OR.
Spectral index is a linear or non-linear combination of reflectance in different bands. Spectral index was used to establish the correlation between spectral data and specific targets, and to provide a scientific basis for soil salinity research [46]. This research mainly applies the spectral characteristic indices including Deviation of arch (DOA), Salinity index (Table 1) and Two-band (2D) index (Table 2): (2) Salinity index (conventional) Table 1. Reference overview of studies of spectral salinity indices and formula.
(3) Two-band (2D) index Table 2. Reference overview of studies of spectral indices and formula.  [56] Note: R i and R j in the formula belong to any two wavelengths in 400-2400 nm, and R i = R j . All thirteen spectral transformations were involved in the calculation of the seven spectral indices mentioned above. For each spectral index, the wavelength combination with the largest correlation with soil EC was extracted and deemed to be the optimal band combination.

Topographical Factors
Topography is the main factor of soil formation and development in arid and semiarid regions, affecting surface material energy and redistribution. Digital Elevation Model (DEM) data were downloaded from the website of Geospatial Data Cloud (http://www. gscloud.cn/, 24 May 2022) at a spatial resolution of 30 m. The DEM of each sampled point was extracted using the Extract Multi Values to Points tool in Spatial Analyst Tools in ArcGIS 10.4, along with slope, aspect, plane curvature, profile curvature and topographic wetness index (TWI) as input variables to the model.

Feature Selection Based on Gradient Boosting Machine
Twenty-four variables (eleven conventional soil salinity indices, seven 2D indices, and six terrain parameters) were selected as feature descriptors. In consideration of the possible over-fitting risk, GBM was introduced to screen out the most important features from the 24 feature descriptors for participation in the subsequent construction of the soil EC model.

Modeling Strategies and Accuracy Assessment
In order to achieve EC predictions and to ensure the generalization and robustness of the models, we divided the dataset into two disjoint sets, and the training and validation sets were assigned by the 5-fold cross validation method [57]. XGBoost, LightGBM, RF, ERT, CART, and RR were used to build an EC inversion model based on the factors selected by GBM. In the toolkit Scikit-Learn (http://scikit-learn.org, 24 May 2022), ML models were first trained with training sets and then the model was used to predict the EC of the validation set. The main parameters were grid searched [58], and the default values of other parameters were Scikit-Learn. The optimal hyperparameters of the model are shown in Table 3. The determination coefficient (R 2 ), mean squared error (MSE), correlation coefficient, standard deviation and root mean square error (RMSE) between the predicted and true values were calculated to evaluate the predictability. A scoring mechanism was developed to pre-evaluate the six ML models. The model with the largest R 2 and lowest MSE values was considered the most robust.

Kriging
Kriging interpolation is a spatial local interpolation method [42], which makes use of the original data and the structure of the semi-variance function in order to get the unbiased best estimate of the unsampled regional variables. It mainly analyzes the structural and random characteristics of the regional variables, and then obtains their spatial distribution characteristics. The soil EC model with the highest inversion accuracy was selected. The kriging interpolation method was used to invert the spatial distribution of the soil EC. The inverted soil EC values were then compared with the interpolation results of 320 measured data by our research group, from 0-20 cm depths of soils from the whole Yinchuan Plain in 2019 and 2021, in order to verify the adaptability of the model on a large scale.

The Spectral Characteristic of the Soil Samples
All hyperspectral characteristic curves of soil were analyzed ( Figure 2). The soil spectral reflectance increased with an increase in the wavelength and with a certain volatility. The pattern of spectra curves was similar across different saline levels, with absorption valleys at 1400 nm and 1900 nm. The spectral curves of salinized soil between 400 and 1400 nm show regular changes with the increase of salinization, that is, the soil reflectance increases with the increase of salinization. Although this rule is not obvious after 1400 nm, this regularity can distinguish different degrees of salinization; based on this we can accurately distinguish different salinization soils through certain treatment.  The pattern of SNV transformation was similar to that of the OR reflectance curve, but the characteristics of the absorption valley and reflection peak of the curve were obviously enhanced ( Figure 3). After SNV-DT transformation, the spectral absorption and reflection characteristics of the curve were enhanced, and several new characteristics lacking in the OR were present, including two reflection peaks in the visible region (near 700 and 800 nm) and the absorption valley (near 2200 nm). After 1/OR and Log (1/OR) transformation, the reflectance showed a downward trend, compared with the OR, the absorption valley features were weakened, and at 1400 and 1900 nm, contrary to the OR features. As the FOD increases from 0 to 2, the intensity of spectral signals weakened, but the spectral detail increased. The absorption valleys at 1400 nm and 1900 nm became more and more obvious, and two small absorption valleys and a reflection peak were observed at 1400 nm. Meanwhile, the absorption valley peaked at 1900 nm gradually with the increase of overall absorption characteristics of the visible region (FOD 0-2). When the order increased from 1 to 2, most of the reflectance values approached zero, one positive peaked at 580 nm, and two negative peaks at 480 nm and 660 nm. Moreover, the 1 order derivative better identified both positive and negative peaks.

Correlation Analysis of Spectral Reflectance and EC
The correlation coefficients between EC and the reflectance processed by SNV, SNV-DT, 1/OR, Log (1/OR), 1 order and 2 order derivatives in the range of 400~2400 nm were computed and plotted, respectively (Figure 4). In the OR, the correlation coefficient showed a steady downward trend with the increase of wavelength. The SNV correlation was slightly higher than the OR correlation in the range of 400~700 nm, and then the correlation coefficient decreased and became negative from 1400 nm. The SNV-DT correlation trends increased in the 400~1100 nm range, crossing with the SNV correlation near 1100 nm, and then the correlation gradually became negative. In 1/OR and Log (1/OR), EC was negatively correlated with the spectral reflectance over the whole wavelength range with smooth change and stronger correlation in the visible and near-infrared parts. The correlation coefficients in the 1 order and 2 order derivatives were alternately positive and negative in wavelength, which increased the correlation of several specific bands. The

Correlation Analysis of Spectral Reflectance and EC
The correlation coefficients between EC and the reflectance processed by SNV, SNV-DT, 1/OR, Log (1/OR), 1 order and 2 order derivatives in the range of 400~2400 nm were computed and plotted, respectively (Figure 4). In the OR, the correlation coefficient showed a steady downward trend with the increase of wavelength. The SNV correlation was slightly higher than the OR correlation in the range of 400~700 nm, and then the correlation coefficient decreased and became negative from 1400 nm. The SNV-DT correlation trends increased in the 400~1100 nm range, crossing with the SNV correlation near 1100 nm, and then the correlation gradually became negative. In 1/OR and Log (1/OR), EC was negatively correlated with the spectral reflectance over the whole wavelength range with smooth change and stronger correlation in the visible and near-infrared parts. The correlation coefficients in the 1 order and 2 order derivatives were alternately positive and negative in wavelength, which increased the correlation of several specific bands. The maximum absolute correlation coefficient (MACC) between EC was 0.596 at 420 nm in 2 order derivation reflectance, which was only 0.396 at 400 nm in OR. maximum absolute correlation coefficient (MACC) between EC was 0.596 at 420 nm in 2 order derivation reflectance, which was only 0.396 at 400 nm in OR. In each spectral transformation, the band with the largest correlation coefficient was extracted ( Table 4). The correlation coefficient was best in all four bands under SNV transformation, because the salinity index was calculated according to required optimum wavelengths varying from 450 to 1050 nm, i.e., blue: 455~492 nm; green: 492~577 nm; red: 622~770 nm; and near-infrared: 770~1050 nm (Figure 5a). Therefore, the corresponding best reflectance under SNV transformation was selected for salinity index calculation. The bands with high correlation of SNVDT-RI, SNV-NDI, 1/OR-RDVI, 2 order derivative-DI, 2 order derivative-NDI, 2 order derivative-NPDI and 2 order derivative-PI to EC, respectively ( Figure 6). The best bands of SNVDT-RI, SNV-NDI, and 1/OR-RDVI were more concentrated, while the best bands under 2 order derivative were more dispersed, mostly in the form of grids and dots. The MACC between SNV-NDI and EC was 0.69, and In each spectral transformation, the band with the largest correlation coefficient was extracted ( Table 4). The correlation coefficient was best in all four bands under SNV transformation, because the salinity index was calculated according to required optimum wavelengths varying from 450 to 1050 nm, i.e., blue: 455~492 nm; green: 492~577 nm; red: 622~770 nm; and near-infrared: 770~1050 nm (Figure 5a). Therefore, the corresponding best reflectance under SNV transformation was selected for salinity index calculation. The bands with high correlation of SNVDT-RI, SNV-NDI, 1/OR-RDVI, 2 order derivative-DI, 2 order derivative-NDI, 2 order derivative-NPDI and 2 order derivative-PI to EC, respectively ( Figure 6). The best bands of SNVDT-RI, SNV-NDI, and 1/OR-RDVI were more concentrated, while the best bands under 2 order derivative were more dispersed, mostly in the form of grids and dots. The MACC between SNV-NDI and EC was 0.69, and the high bands were mainly concentrated around 1200 nm and 1600 nm with the explicit expression [(R i − R j )/(R i + R j )]. Overall, all 2D indices under 2 order derivative were subsequently selected to participate in the EC modeling, because the best spectral indices under 2 order derivative transformation were more numerous (Figure 5b). the high bands were mainly concentrated around 1200 nm and 1600 nm with the explicit expression [(Ri − Rj)/(Ri + Rj)]. Overall, all 2D indices under 2 order derivative were subsequently selected to participate in the EC modeling, because the best spectral indices under 2 order derivative transformation were more numerous (Figure 5b).   the high bands were mainly concentrated around 1200 nm and 1600 nm with the explicit expression [(Ri − Rj)/(Ri + Rj)]. Overall, all 2D indices under 2 order derivative were subsequently selected to participate in the EC modeling, because the best spectral indices under 2 order derivative transformation were more numerous (Figure 5b).  . Two-dimensional correlation coefficients between EC and optimal spectral index under different transformation reflectance and two derivative orders (The x and y axis represent the wavelength 400~2400 nm, the right-side color bar indicates the color of the PCC values. The colors dark Figure 6. Two-dimensional correlation coefficients between EC and optimal spectral index under different transformation reflectance and two derivative orders (The x and y axis represent the wavelength 400~2400 nm, the right-side color bar indicates the color of the PCC values. The colors dark red and dark blue represent a relatively high PCC (red for positive and blue for negative) between the measured EC and the band combinations).
In general, two-dimensional correlation coefficients show higher correlation values compared to one-dimensional. In the case of NDI at SNV, the best correlation coefficient between spectral reflectance and EC in one dimension was 0.488, which increased by 0.201 at a two-dimensional spectral index (Table 4).

Feature Selection and Importance Analysis
Based on GBM results, the top 13-ranked features could provide more than 90% of the cumulative importance for the model (Figure 7) (the salinity index SI was denoted as SI (1), and the 2D index SI as SI (2)). Therefore, the top 13 feature variables were selected as the independent variables for the subsequent model. red and dark blue represent a relatively high PCC (red for positive and blue for negative) between the measured EC and the band combinations).
In general, two-dimensional correlation coefficients show higher correlation values compared to one-dimensional. In the case of NDI at SNV, the best correlation coefficient between spectral reflectance and EC in one dimension was 0.488, which increased by 0.201 at a two-dimensional spectral index (Table 4).

Feature Selection and Importance Analysis
Based on GBM results, the top 13-ranked features could provide more than 90% of the cumulative importance for the model (Figure 7) (the salinity index SI was denoted as SI (1), and the 2D index SI as SI (2)). Therefore, the top 13 feature variables were selected as the independent variables for the subsequent model.

Establishment and Verification of Soil EC Inversion Models
The selected spectral parameters and topographic factors by GBM and EC contents were applied as input and output datasets, respectively, to construct an EC model via the XGBoost, LightGBM, RF, ERT, CART and RR methods. The prediction effect of different models on the training set is shown in Figure 8

Establishment and Verification of Soil EC Inversion Models
The selected spectral parameters and topographic factors by GBM and EC contents were applied as input and output datasets, respectively, to construct an EC model via the XGBoost, LightGBM, RF, ERT, CART and RR methods. The prediction effect of different models on the training set is shown in Figure 8 Figure  9a). Overall, the ERT model performs the best. ERT, XGBoost, RF and CART better predicted soil EC within the MSE range of 0.37 to 2.02 (Figure 9b).  ERT (0.98) > XGBoost (0.96) > RF (0.94) > CART (0.91) > LightGBM (0.80) > RR (0.51) ( Figure  9a). Overall, the ERT model performs the best. ERT, XGBoost, RF and CART better predicted soil EC within the MSE range of 0.37 to 2.02 (Figure 9b).

Testing of Predictive Models
This study employed the trained model for the July 2018 data (n = 42) to examine the ERT model's prediction performance for soil EC inversion. All of the procedures for calculating salinity indices and modeling were the same as those stated previously. The model showed significant correlation with the measured (true) data (R 2 = 0.96) (Figure 10a). The model had a strong linear relationship between the measured and simulated (true) values (y = ax + b, R 2 = 0.96) (Figure 10b). Therefore, the ERT model was the best model for EC simulation.
Remote Sens. 2022, 14, x FOR PEER REVIEW 13 of 20 Figure 9. The normalized Taylor diagrams of the predicted and measured EC data (a) and the model accuracy comparison and mean squared error (MSE) of the six methods (b).

Testing of Predictive Models
This study employed the trained model for the July 2018 data (n = 42) to examine the ERT model's prediction performance for soil EC inversion. All of the procedures for calculating salinity indices and modeling were the same as those stated previously. The model showed significant correlation with the measured (true) data (R 2 = 0.96) ( Figure  10a). The model had a strong linear relationship between the measured and simulated (true) values (y = ax + b, R 2 = 0.96) (Figure 10b). Therefore, the ERT model was the best model for EC simulation.

Digital Soil Maps of EC
The kriging interpolation method of inversion of the spatial distribution of soil EC was used ( Figure 11). The ERT model inversion effect and the measured value were very close. In general, overall trends in soil EC in the north-east and midwest are relatively serious, with salinity conditions from north to south continuously reducing. The results were consistent with the field investigation and demonstrated the validity of the model to some extent. To verify the application of the model on a large scale, 320 data collected in 2019 and 2021 from Yinchuan Plain were used. Topographically, the areas with the lowest degree of salinization were mainly located in the mountainous areas in the southern part of the plain. The highest salinity in the region was found in the Xidatan area in the northwest. In addition, it can be seen that the EC values in Pingluo county in Yinchuan Plain are relatively consistent with the validation effect of the model, which indicates that the model can be used for large-scale inversion.

Digital Soil Maps of EC
The kriging interpolation method of inversion of the spatial distribution of soil EC was used ( Figure 11). The ERT model inversion effect and the measured value were very close. In general, overall trends in soil EC in the north-east and midwest are relatively serious, with salinity conditions from north to south continuously reducing. The results were consistent with the field investigation and demonstrated the validity of the model to some extent. To verify the application of the model on a large scale, 320 data collected in 2019 and 2021 from Yinchuan Plain were used. Topographically, the areas with the lowest degree of salinization were mainly located in the mountainous areas in the southern part of the plain. The highest salinity in the region was found in the Xidatan area in the northwest. In addition, it can be seen that the EC values in Pingluo county in Yinchuan Plain are relatively consistent with the validation effect of the model, which indicates that the model can be used for large-scale inversion.

Hyperspectral Pre-Processing
Spectral data pre-processing can effectively remove background noise, baseline effect and multiplier interference, which is an important step in extracting useful spectral information and optimizing quantitative effects of models [59]. Selecting the most suitable preprocessing method to process all the datasets is difficult and infeasible [13]. In this study, five kinds of mathematical transformations (SNV, SNV-DT, 1/OR, Log (1/OR), and FOD (range 0-2, with intervals of 0.25)) were carried out on the basis of the original characteristic spectrum, pre-processing the spectral data with different degrees of change ( Figure   Figure 11. Spatial distribution of soil EC (a) measured value, (b) ERT of study area, (c) sample point in Yinchuan Plain, (d) measured value in Yinchuan Plain.

Hyperspectral Pre-Processing
Spectral data pre-processing can effectively remove background noise, baseline effect and multiplier interference, which is an important step in extracting useful spectral information and optimizing quantitative effects of models [59]. Selecting the most suitable pre-processing method to process all the datasets is difficult and infeasible [13]. In this study, five kinds of mathematical transformations (SNV, SNV-DT, 1/OR, Log (1/OR), and FOD (range 0-2, with intervals of 0.25)) were carried out on the basis of the original characteristic spectrum, pre-processing the spectral data with different degrees of change ( Figure 3). Through pre-processing, the reflection and absorption peak valley were better identified in FOD transformation. Although spectral intensity gradually decreased, spectral details increased. Many tiny peaks began to appear and grew with the increase of the derivative order, which achieved the purpose of refining the variation trend and reducing the information defect [60].
According to the analysis of the correlation between the reflectance of spectral transformation and EC, the MACC of the two-dimensional correlation coefficient was much higher than that of the one-dimensional (Table 4). In one-dimensional, the MACC of SNV and of FOD were higher than that of OR. These results confirmed that partial spectral pre-processing eliminates or reduces unwanted side effects in reflection spectra. The twodimensional correlation coefficient was much larger than the one-dimensional, indicating that the two-band spectral index fully considers the interaction between spectral bands and effectively eliminates the overlapping absorption of soil components [21]. The twodimensional correlation coefficient further reveals the potential use of the optimal band algorithm in determining sensitive spectral variables related to EC content. It can be seen that SNV has the best transformation effect in the Vis-NIR band, which may be because SNV improves the signal-to-noise ratio of the original absorbance spectrum and enhances the spectral absorption information related to component content [61]. The best transformation band correlated with EC in the 2 order derivative (PCC 420nm = 0.596), because the spectral derivative effectively deals with nonlinear problems, enhances the difference between similar spectra, and eliminates background noise. SNV and 2 order derivative have the largest improvement in two dimensions compared with one dimension, which were 0.286 and 0.258, respectively. Peon et al. [62] successfully transferred the spectral reflectance of the laboratory to different satellite sensors by means of different spectral response functions, or directly extracted sensitive spectral parameters from satellite sensors. Therefore, we can use SNV and 2 order derivative to calculate the optimal spectral index in order to simplify the input variables and develop increasingly efficient and specific EC estimation models.

Feature and Model Selection
Studies show that the robustness of the model improves by removing potentially irrelevant environmental variables [19]. In this study, the importance of variables was ranked based on GBM, and it was found that DEM, slope, aspect, plane curvature, profile curvature and TWI all participate in EC modeling. Topographic variables determine the movement direction of run-off water, thus changing the accumulation mode and location of salt in soil [25]. Taghizadeh-mehrjardi et al. [63] proved that DEM and its derivatives were significantly correlated with soil salinity and have great potential in monitoring soil salinity. Peng et al. [64] established an EC inversion model of the Aksu River Basin in Xinjiang using terrain attributes and Landsat 8 OLI index, with R 2 reaching 0.92. It is possible to apply this hyperspectral feature and model for remote sensing.
Prediction accuracy is the most important factor for soil properties inversion. At present, scholars have conducted relevant studies on the inversion of soil EC, but the models have mixed results [65][66][67][68][69]. In this study, ERT and RF models perform better than XGBoost, LightGBM, CART, and RR ( Figure 8). This may be due to introducing random attribute selection during the training of the RF model, and due to the extraction of data based on randomness and difference [70], which greatly improved the accuracy of decision making [33,71]. Studies have shown that ensemble learning and variable selection can improve the consistency of models with predicted values and true values when variables are complex [72]. Compared with bagging, boost's prediction of EC was slightly inferior, and the comparison of the two boosting methods shows that the XGBoost algorithm has a better prediction effect. XGBoost avoids the over-fitting problem to a large extent by introducing regularization terms, thus improving the generalization ability of the model [39]. Therefore, XGBoost can be used as an effective method to build a simulation estimation model of salt content in a certain region. LightGBM is an efficient implementation of a gradient boosting decision tree (GBDT). As a distributed gradient lifting framework, it is mainly optimized in the training speed and memory of the model. Due to the unique leaf-wise strategy of LightGBM, it is easier to control the model complexity, however, it is difficult to give full play to its advantages in the case of small data sets [38]. This may be the reason for the poor performance of LightGBM in the problem background of this study. It should be pointed out that the best single model cannot guarantee the highest accuracy under changing input data or future conditions [73].

Spatial Distribution of Soil Salinity
Interpolation through the model between the predicted values and measured values show that soil EC values in the midwest and northeast of Pingluo County are more serious. From north to south, the salinity condition continuously decreases, where most soils with low salinity are distributed. In general, both in the region landform, groundwater, drainage and other related factors, distribution of Sha lakes, the west lake, and Mingshui lake in midwest Pingluo county, in the west Xidatan town was more serious. In the north of the plain, due to shallow groundwater, poor drainage and a high evaporation ratio, salt crusts commonly form. In the south, the terrain elevation is higher and the drainage is smooth. Therefore, the salinization degree is low, and most soils are slightly saline or non-saline.
Yinchuan Plain is an important ecological barrier of China's western region and reserved cultivated land resources, with 30% arable land. Yinchuan Plain provides 50% of the food produced in Ningxia Province. However, soil salinization in the Yinchuan Plain has become a particular inhibitor of crop growth and of the healthy development of agricultural and ecological systems [74]. Thus, various approaches, including leaching with low EC water and adding gypsum, as well as cultivating salt-tolerant plants, have already been attempted in order to alleviate salinization effects [75]. Nevertheless, the success of such approaches needs continuous monitoring of changes in soil EC, which can be time consuming or infeasible for large areas. An accurate remote sensing approach to monitor soil salinization, as suggested in this study, will overcome these problems and will enable the avoidance of future salinization, especially when the cultivated land area and grain yield should be increased to guarantee national food security.

Uncertainty Analysis of Soil EC Inversion Model
The accuracy of hyperspectral based monitoring of soil salinization is susceptible to data pre-processing, and factors such as soil texture, moisture content, organic matter content, etc. that affect the soil spectrum and the selection of salt indices and vegetation indices. Moreover, the division of modeling validation sets also affects the model accuracy [76,77]. In the inversion study of soil electrical conductivity, the correlations were mainly between the reflectance and the degree of salinity. Although the properties of saline soils are comparatively changeable and so may affect the reflectance, such changes, or the intensity of changes and consequences for the reflectance, are also due to or dependent on the salinity levels. Our results show that the 2 order derivative is the best spectral processing method and that the terrain factor plays a dominant role in the model, which makes it superior to spectral variables and provides a basis for the simulation of soil salinization at larger scales. Nevertheless, no universal spectral index or variable can give a single satisfactory result under any environmental condition. Therefore, the selection of optimal modeling parameters should be based on the regional and environmental conditions, rather than on fixed parameters.
Simple near-ground remote sensing methods cannot reflect the characteristics of salinization. Therefore, comprehensive multisource data is needed as a new way to study the complex salinization monitoring problem, by obtaining more soil and training information [78]. The soil samples used in this paper are mainly concentrated in arid areas: whether the established optimal model can be applied to other areas remains to be further discussed and verified. The applicability of different ensemble learning algorithms needs to be explored for arid oasis and for coastal saline soils, and other types of models should be selected for comparison and optimization. Our approach can also be applied to predict other soil properties such as texture and clay mineralogy, or to determine different soil layers or soil types based on different data sets and verifications. Nevertheless, in the future, the study range and soil sample size need further expansion to complete the soil hyperspectral database, and the environmental variables should be appropriately considered in order to improve the accuracy of the model. It is necessary to study how soil properties affect the response mechanism of the hyperspectral inversion of the EC content, and to combine models and images in order to conduct large-scale mapping and monitoring of soil salinization [10].

Conclusions
To find the best hyperspectral inversion of soil electrical conductivity, this study was conducted at northern Yinchuan Plain, China, where serious soil salinization is faced. The performance of different pre-processing techniques on different models of the same data set were applied. The extremely randomized trees (ERT) and light gradient boosting machine (LightGBM) models were studied. We used soil measured hyperspectral and salinity data to explore the feasibility of identifying EC via Vis-NIR spectral model outputs. The correlation between reflectance and EC was analyzed using different hyperspectral pre-processing methods, and the spectral indices were calculated. Under different spectral transformation and salinity indices, the correlation between 2 order derivative and SVN-NDI and EC were the largest, with 0.596 and 0.689, respectively. The SNV and 2 order derivative pre-processing techniques exerted a strong influence on improving the correlations. The salinity features selected based on GBM include slope, NDI, SI-T, RI, profile curvature, DOA, plane curvature, SI (conventional), elevation, Int2, aspect, S1 and TWI. Among the six EC inversion models (XGBoost, LightGBM, RF, ERT, CART and RR), the ERT model performs best. The optimal parameter combination after the grid search was as follows: the n_estimator was 21, the max_depth was 14, and R 2 and MSE were 0.98 and 0.37, respectively. Based on the validation of the prediction model, the machine learning model (ERT) can be applied to the prediction of EC. The ERT model provides a new method for the accurate inversion of soil EC. Our study provides a reference for the inversion of soil EC, and a basis for soil salinization simulation on a larger scale, which can support the sustainable development of local agriculture and protect the ecological environment in arid areas.