Integrating Remote Sensing and Landscape Characteristics to Estimate Soil Salinity Using Machine Learning Methods: A Case Study from Southern Xinjiang, China

: Soil salinization, one of the most severe global land degradation problems, leads to the loss of arable land and declines in crop yields. Monitoring the distribution of salinized soil and degree of salinization is critical for management, remediation, and utilization of salinized soil; however, there is a lack of thorough assessment of various data sources including remote sensing and landscape characteristics for estimating soil salinity in arid and semi-arid areas. The overall goal of this study was to develop a framework for estimating soil salinity in diverse landscapes by fusing information from satellite images, landscape characteristics, and appropriate machine learning models. To explore the spatial distribution of soil salinity in southern Xinjiang, China, as a case study, we obtained 151 soil samples in a ﬁeld campaign, which were analyzed in laboratory for soil electrical conductivity. A total of 35 indices including remote sensing classiﬁers (11), terrain attributes (3), vegetation spectral indices (8), and salinity spectral indices (13) were calculated or derived and correlated with soil salinity. Nine were used to model and estimate soil salinity using four predictive modelling approaches: partial least squares regression (PLSR), convolutional neural network (CNN), support vector machine (SVM) learning, and random forest (RF). Testing datasets were divided into vegetation-covered and bare soil samples and were used for accuracy assessment. The RF model was the best regression model in this study, with R 2 = 0.75, and was most e ﬀ ective in revealing the spatial characteristics of salt distribution. Importance analysis and path modeling of independent variables indicated that environmental factors and soil salinity indices including digital elevation model (DEM), B10, and green atmospherically resistant vegetation index (GARI) showed the strongest contribution in soil salinity estimation. This showed a great promise in the measurement and monitoring of soil salinity in arid and semi-arid areas from the integration of remote sensing, landscape characteristics, and using machine learning model.


Introduction
Soil salinization is a severe environmental hazard posing a considerable threat to global land degradation [1,2]. Soil salinity exerts negative impacts on ecosystem health, soil quality, and crop breeding and harvest, affecting approximately 20% of irrigated farmlands and agricultural ecosystems worldwide [3,4]. Soil salinity results from a complicated progression related to climate, groundwater, topography, and human activities [5,6]. For example, salt-induced degradation is more pronounced in semi-arid and arid regions. Human activities such as tillage in a natural environment characterized by low precipitation, high soil evaporation, and high groundwater level [7,8] make cultivated soils more vulnerable to serious salinization problems. The demand for natural resources and food from the increasing population require more land to be used for farming, including marginal, vulnerable, and already degraded lands such deserts and lands affected by salinity [9,10]. Therefore, careful monitoring, quantitative assessment and analysis, and mapping to reveal the temporal and spatial distribution of soil salinity have become pressing concerns for land management and reclamation of salinized soil [11,12].
Under complex human activities and geological cycles, soil salinization exhibits high spatial and temporal variability [2], monitoring of which can provide information for land management [12]. Traditional methods to detect and assess soil salinity require intensive regular field work and laboratory analyses, which are time-and cost-consuming [13]. Sampling in a large area is almost impractical for frequent experimentation [14,15]. To solve the problem, in situ measurements using proximal sensing electromagnetic (EM) induction instruments have been used to estimate soil apparent electrical conductivity (EC a ) and simplify field sampling work [16]. The measurement of soil salinity at individual points or at a field-scale rather than large spatial scales often limit the use of these proximal soil sensors [12,17,18]; although the estimation accuracy has been improved, it still requires a large amount of manpower. As an alternative, digital soil mapping (DSM) can be used to predict soil properties, with limited soil samples and environmental covariates derived from remote sensing and ancillary data. DSM helps in estimating the spatial distribution of soil salinity at a large scale, from either sparse or discrete samples [19,20]. Satellite remote sensing technology can provide substantial soil information, a salient advantage of broad spatial coverage and periodic measurements over traditional field surveys or even surveys with proximal soil sensing [12,21]. Salt in soil exhibits particular absorption features of the soil surface within the electromagnetic spectrum range, while non-saline soils show higher reflectance in visible and near-infrared wavelengths. This provides theoretical support for using multispectral sensors and hyperspectral data to estimate soil salinity [22,23]. Data collected from various multispectral sensors, such as the moderate resolution imaging spectroradiometer (MODIS), the Landsat series, IKONOS, the HuanJing series, and the GaoFen series have been applied to extract salinization information across the world [1].
As the process of salinization is affected by topography, vegetation cover, and soil moisture, the use of remote sensing images alone cannot indicate the distribution of salinity efficiently [24]. Surface salt can be sufficiently monitored from satellite imagery only if the soils are bare and dry. The topography and vegetation cover largely affect the migration and deposition of salt, and humid weather brings moisture to the soils and turns the surface colors darker in the imagery, which sharply decreases the accuracy of soil salinity detection [25]. Different covariates contribute differently to the formation of soil salt. Choosing the independent variables with strong explanatory properties of soil salinity to estimate soil salinity helps to improve the accuracy of the estimation and the speed of calculation. Correlation coefficients are used to measure the correlation between covariates and soil salinity, so as to evaluate whether the covariates can be used for modeling. In the process of modeling training datasets, the RF regression model can be used to obtain variable importance. The obtained RF variable importance is accuracy-based importance (MDA, mean decrease accuracy) [1]. In order to eliminate the influence of multicollinearity on interpretation, the partial least squares path modeling (PLS-PM) is used to evaluate the interaction of independent variables [26]. Evaluating the correlation between covariates to soil salinity and each other is beneficial to reveal the key indicators of soil salts formation or estimation. Various studies have successfully employed satellite remote sensing data and auxiliary data to reveal the distribution of soil salinity on the basis of the correlations between several indices derived from information on soil properties, environmental factors, spectral bands, and soil reflectance spectra at different spatial scales [16,23,27,28]. As radar has a robust capability for penetrating land cover in all weather, a new opportunity for estimating soil density may be presented by use of radar imagery, including Sentinel-1, Seasat SAR, JERS-1 SAR, and ERS-1/2 SAR [29].
Construction of models between covariates (e.g., salinity controlling factors) and dependent variables (e.g., soil salinity) can accurately estimate the distribution and extent of soil salinity. The models for estimating soil salinity can be divided into linear regression models and nonlinear regression models. Most linear models such as multiple factor regression (MRF), inverse density weighted (IDW) regression, and partial least squares regression (PLSR) have been applied to determine soil salinity [30]. However, most linear models show poor estimation accuracy in regions with high spatial variability in salinity [28]. Recently, strong momentum has been gained in soil salinity mapping using machine learning models such as support vector machine (SVM), cubist, and random forest (RF) [31,32]. The neural network (NN) models such as back propagation neural network (BPNN), multi-layer perceptron neural network (MLP-NN), and convolutional neural network (CNN) are gradually being used for salt estimation [33]. Different independent variables are used to model soil salinity, and appear in different estimation accuracies [34]. The formation of soil salinity is a complex process controlled by multiple factors. Thus, simple linear sum of effects of multiple factors may not reveal the actual situation, while nonlinear models can better fit the contributions of various factors of soil salinity.
The challenge of efficiently estimating soil salinity in regions with high spatial variability and diverse landscapes by machine learning models has recently gained attention of researchers [28,35]. A comprehensive assessment of contributions from multiple factors of soil salinization using machine learning models remains lacking. A thorough assessment of environmental covariates that are highly correlated with soil salinity should be done to quantify spatial and temporal variability of soil salinity in conjunction with adaptable machine learning models and appropriate remote sensing imageries. The overall goal of this research was to use machine learning methods to develop a framework in estimating soil salinity in diverse landscapes by fusing information from satellite images and landscape characteristics. Specifically, this research aimed to (a) explore suitable covariates to reduce interference by soil moisture, vegetation, and other factors on the remote sensing estimation of soil salinization; (b) estimate soil salinity by employing partial least squares regression (PLSR), convolutional neural network (CNN), support vector machine (SVM), and random forest (RF) models using factors derived from remote sensing imagery and landscape characteristics; (c) quantitatively estimate and map the soil salinization in diverse landscapes of southern Xinjiang, China, as a case study area with high accuracy; and (d) analyze and identify the important factors. For areas with diverse and inaccessible landscapes, these will provide a framework to improve soil salinity monitoring and mapping, which could improve land management and planning and assessment of reclamation activities.

Study Area
The study area is in the center of Aksu district, southern Xinjiang Province, northwestern China (40 • 41 -41 • 20 N, 80 • 42 -81 • 20 E) ( Figure 1). It extends from east to west for 70 km and from north to south for 92 km. There is a north-south provincial highway (S215) running through the entire study area. The study area is a typical alluvial fan from northwest to southeast and experiences long daylight hours with sufficient solar thermal resources. The climate is a typical continental warm temperate arid climate, with a low average annual rainfall of 60 mm and a high average annual evaporation of 2000 mm [16]. The complexity of landform, arid climatic conditions, intense evapotranspiration, and high level of underground water contribute to the accumulation of soil salinity [36]. The principal type of land use is desert in the center, and recently cultivated land in the northwest and south of the study area. Despite the soil salinization in desert areas, there are still some halophytes growing in the northwest of the central part, including Tamarix, Halocnemum strobilaceum, Halostachys caspica, reeds, and Alhagi sparsifolia Shap. [28]. The phenomenon of salt accumulation in the surface layer of the soil is particularly severe in the dry season.

Soil Sampling and Laboratory Analysis
Soil samples were collected on 27 July 2017 along the S215 provincial highway ( Figure 1). Along the sides of the road, sample plots of uniform landform of 30 × 30 m were chosen. The topographic features of the study area were uniform and flat, and the Quincunx sampling method [37] was used to collect 5 topsoil samples (0-20 cm). The Trimble Juno SB handheld GPS was used to record the latitude and longitude of each sampling point. Trimble Juno SB handheld GPS used differential positioning to characterize the geographic location of the point with accuracy of 2-5 m, which can be used for the scale of the study area. To prevent moisture evaporation, we placed samples in sealed plastic bags. Geolocations of the sampling points were imported and overlaid on the synchronized remote sensing image in ArcGIS to check the distribution. A total of 151 sampling points were sampled for further analysis. Soil samples from the same pixel were fully mixed and homogenized, and then they were subsampled to retain 300 g of soil for further laboratory analysis. Visually observable stones and leaves were first removed and then soil samples were air-dried, ground, and sieved to 2 mm size. Considering that the soil conductivity in the study area is relatively high, the soil leachate was obtained with a 1:5 soil/water ratio using a LeiCi DDS-307 (ShengKe, Shanghai, China) conductivity meter to measure the electrical conductivity.

Satellite Imagery and Preprocessing
Satellite imagery from Landsat-8 and Sentinel-1 were downloaded for the same date of field sampling. Landsat-8 was launched by the Landsat program on 11 February 2013 and comprises an operational land imager (OLI) and a thermal infrared sensor (TIRS). These two sensors monitor 11 bands (Table 1) [38]. The OLI provides 8 bands at a spatial resolution of 30 m and a panchromatic band at 15 m, and the TIRS provides 2 thermal infrared bands at a spatial resolution of 100 m. The Sentinel-1 satellite is an earth observation satellite from the European Space Agency's Copernicus Project (GMES), and it was launched on 3 April 2014. It carries a C-band synthetic aperture radar, which can provide all-day images in various weather conditions. Sentinel-1A data was acquired in the interferometric wide (IW) mode C-band, with dual polarization vertical transmit vertical receive (VV)/vertical transmit horizontal receive (VH) at a spatial resolution of 20 m [39,40]. The orbit number of Sentinel-1A image covering the study area was 17635 (Table 2) [41,42].
There were a few clouds in the middle of the Landsat-8 image, and the "Fmask" tool was used to remove clouds [36]. To characterize landforms, we applied radiation correction and atmospheric correction using Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLAASH). After applying orbital correction, thermal noise removal, radiation correction, Lee filtering, and range Doppler terrain correction to Sentinel-1 images, we converted radar backscattering coefficients of the two VV and VH bands from digital number (DN) to a decibel (dB) [32]. In addition, both Landsat-8 data and Sentinel-1 data were resampled to the same spatial resolution of 30 m using resample function in ArcGIS. After preprocessing, single bands were used as independent variables, or bandmath function was used to calculate vegetation spectral indices, salinity spectral indices, and backscattering coefficients from original bands.

Auxiliary Data
In most cases, single-band data and auxiliary data such as vegetation cover and landform characteristics or topographic factors were coupled to characterize the distribution of soil salinity [24,43]. Auxiliary data included remote sensing-based auxiliary data, radar-based auxiliary data, and DEM-based auxiliary data (Table 3), and was used for estimation. Remote sensing-based auxiliary data included vegetation spectral indices and salinity spectral indices. Halophytes can grow in saline soil in arid areas; thus, vegetation indices can also be used to characterize soil salinity. Previous studies used vegetation indices including normalized difference vegetation index (NDVI), generalized difference vegetation index (GDVI), green atmospherically resistant vegetation index (GARI), extended NDVI (ENDVI), and enhanced vegetation index (EVI) to indicate soil salinity by monitoring the halophytic properties of plants, and salinity indices including salinity indexes can be used to directly indicate the content of soil salts [28,44]. Radar-based auxiliary data consist of backscattering coefficients after removing the influence of vegetation [32,45]. To reveal the impact of vegetation on the accuracy of estimation, we calculated vegetation coverage (VFC) for the study area. VFC was used to identify vegetation covered (VFC > 45%) and bare soil pixels (VFC < 45%). A water cloud model was applied to remove the effect of vegetation water content (VWC) on backscattering coefficients [12,33]. A digital elevation model (DEM) at a spatial resolution of 30 m with terrain attributes including elevation, slope, and aspect derived from the DEM in ArcGIS were regarded as DEM-based auxiliary data. DEM data were obtained from the Shuttle Radar Topography Mission (SRTM; https://www2.jpl.nasa.gov/srtm/). In the study area, the DEM of each pixel was available, and the Landsat-8, sentinel-1A, and DEM data were all resampled to a spatial resolution of 30 m. Table 3. Thirty-five indices including auxiliary data (remote sensing data, terrain attributes, vegetation spectral indices, salinity spectral indices, and radar data) used to estimate soil EC in this study along with their abbreviations, calculation formulae, and the reference of calculation.

Auxiliary Data Land Surface Parameters Abbreviation Formulations References
Remote sensing data Terrain attributes

Modeling Strategies
In this study, the total soil samples were divided into 2 parts: 103 samples for training and 48 samples for testing. The soil samples were sorted from low to high according to the measured EC values, and 1 in every 3 samples were randomly selected to comprise the testing datasets. Training datasets and testing datasets were independent from each other. To choose the variables sharing significant influence on soil EC and to improve the efficiency of estimation, we applied factor correlation and significance analysis to filter independent variables. Suitable covariates were then employed to quantitatively reveal the spatial distribution of soil salinity. The PLSR, CNN, SVM, and RF models were used in this study.
PLSR is a typical linear regression model that integrates the advantages of principal component analysis into regression. It shows superiority in a situation in which the number of variables is very large with strong collinearity and noise [28]. Among previous investigations, PLSR has proven to be the most widely used linear regression model for estimating soil attributes [33,55]. The partial least squares regression model firstly extracts a principal component from the independent variable matrix and the dependent variable matrix. The principal components need to contain the variation information in their respective matrices as much as possible and maximize the correlation between the dependent variable components and the independent variable components. After the first component extraction, the regression of the 2 components to their respective source matrices was established, and the estimation accuracy was used to evaluate the results. If not, the residuals of the two variable matrices after the previous regression were used for another component extraction and regression until a satisfied accuracy was obtained [19]. Assuming a total of k rounds of component extraction and regression process, we finally obtained a regression model composed of k independent variable components [13,55]. In the modeling, the number of component extraction and regression processes ranged from 1 to 10, and the regression model with 6 extractions showed the best performance. Support vector machine regression is a supervised machine learning method developed on the basis of statistical learning theory and can avoid overfitting [23,34]. SVM shows high estimation accuracy when modeling variables without collinearity. The principle of modeling is to find hyperplanes in high-dimensional space. It uses a nonlinear mapping algorithm to convert linearly inseparable samples into high-dimensional features using the principle of structural risk minimization (SRM) based on the Vapnik-Chervonenkis (VC) dimension in order to make them linearly separable [34,56]. For nonlinear models, the support vector machine regression model enables high-dimensional feature spaces to use linear algorithms to perform linear analysis on the nonlinear features of samples [23]. In SVM modeling, radial basis function was used as the kernel function. The cost range was set from 0.0001 to 0.1, and the gamma range was set from 1 to 1000. The best result was obtained when the cost was 10 and the gamma was 0.01.
RF is a flexible ensemble-learning algorithm based on decision trees. The basic idea of RF is to generate multiple independent decision trees using random samples and finally to yield one single estimation determined by each decision tree in the forest [16]. RF has an advantage of reducing the risk of overfitting by taking the average of decision trees. In addition, RF is relatively stable when faced with extreme values because they only affect one decision tree and are unlikely to affect the result. The parameters of the model were the number of trees (ntree) and the number of variables selected when each node is split (mtry). In order to obtain the best model, we selected the root mean square error of the corresponding model of different mtry, and then selected the mtry of the model that can obtain the smallest error value as the optimal number of variables. The parameters of the model were the number of trees (ntree) and the number of variables selected when splitting each node (mtry) [22,33]. In RF modeling, the number of trees is 500. In order to obtain the best model, we used mtry ranging from 1 to 20 to loop with the minimum root mean squared error used as the criterion. The best result was obtained when the number of variables at each split was 9, the node size was 12, and 10 features were randomly extracted at each node of each tree.
CNN is an efficient deep learning model widely used in computer vision and image classification [57]. CNN is developed on the basis of a neural network (NN) algorithm and features feedforward performance with deep structure. In recent research, NN has been efficiently applied to estimate soil salinity using satellite images, and CNN is also used for estimation [23]. It simulates the process of biological visual perception mechanism and can perform supervised learning and unsupervised learning. A CNN regression model typically consists of an input layer, a set of successive hidden layers including a convolutional layer, a pooling layer, and an output layer [58]. During the processing of activation function and the convolution kernel, the feature vectors of the input layer are calculated several times in the hidden layer, which is used to fit the regression relationship and display the estimation in the output layer. The procedure performs as a fully connected multilayer perceptron. The CNN model was framed with the tensorflow module. In the CNN modeling procedure, the one-dimensional data of 9 attributes was calculated into a 3 × 3 two-dimensional matrix; firstly, there were 4 hidden layers, and a 2 × 2 convolution kernel with ReLU as an activation function was used in the convolution layer. In each convolution, through the calculation of the 2 × 2 convolution kernel, each pixel was calculated to twice the original. Considering that there were only 9 soil properties used, the input matrix size was not large, the dimension of the input data was 9, and the dimension of the output data was 1. After 4 calculations, each pixel thickened to 256 pixels, and 3 × 3 × 256 neurons were obtained, which produced three-dimensional data. Then, two fully connected layers were used to reduce the dimensionality of the data to one dimension to achieve regression. To prevent overfitting, the learning rate was initiated at 0.000005 and reduced to 0.98 of the former figure every 20,000 times. The probability parameter of the drop-out layer was 0.75, and a total of 200,000 iterations were performed. In addition, the GradientDescentOptimizer function was used in Tensorflow to achieve gradient descent. Momentum was not used, but the attenuation of the learning rate can also help improve training efficiency and accelerate to convergence.

Partial Least Squares Path Modeling
In the interpretation of the characteristics of the independent variables, multicollinearity of them will affect the importance of modeling, and thus partial least squares path modeling (PLS-PM) can be used to assess the interaction of independent variables [26]. A path analysis model can decompose the influence of independent variables on dependent variables into direct and indirect impacts, making the causality of variables more specific. The model of partial least squares (PLS) path modeling (PM) is a variance-based structural equation modeling (SEM) technique that is widely used to explain the connections between variables [59,60]. It can model causal paths between latent variables (LV) to explain an inner model, and the measured variables (MV) to explain the outer model [61]. In the outer model, the influence between LV and MV are quantified by weights (λ). In the inner model, the links between LV are quantified by path coefficients (β) to explain the connection coefficient (r) of independent variables and dependent variables [59].

Accuracy Assessment and Uncertainty Assessment
To evaluate the accuracy of statistical models for soil salt estimation, we adopted several accuracy indicators including R 2 (Equation (1)) and root mean squared error (RMSE) (Equation (2)) were adopted [21,62].
where n is the number of samples, Y testing,i is the ith measured EC, and Y model,i is the ith simulated EC by each model. To reduce the specificity of the particular training set selection, we used 50× non-parametric bootstrapping to measure the uncertainty associated with the estimation. Bootstrapping was performed with the bagging method to randomly extract data from the training dataset [13]. After each sample was collected, the sample was replaced, while un-extracted data were not used in the modeling set and were considered out-of-bag (OOB) [63]. In the training process of modeling, we used a fivefold cross-validation method to obtain the best parameters of the model with the minimum of root mean squared error (RMSE). For each testing process and each grid point, we constructed 50× PLSR, CNN, SVM, and RF models to calculate the estimated EC. Then, the average of 50 soil EC values from each regression model was obtained as the final result.
We used 90% confidence intervals (CIs) (Equation (3)) to indicate that the true value of soil EC value has 90% possibility within the interval between upper and lower CIs limits [64]. Moreover, the uncertainty (Equation (4)) was used to estimate the prediction.
where Y is the average soil EC value of the 50× estimations. The number of a was 1.676 when the number of repetitions is 50, and the confidence interval is 90%. SD is the standard deviation of estimations. CI upper and CI lower are the lower and upper bounds of CIs. Figure 2 shows the spatial distribution and violin plot of training and testing datasets. Randomly selected testing data were distributed at various locations of the sampling points in the study area. Table 4 presents descriptive statistics of EC for training datasets; testing datasets; and all samples under bare soil, vegetation cover, and the entire sampling group. The descriptive statistics of measured EC value of training datasets and the testing datasets were as close as possible so that the modeling could be applied to as much testing data as possible. There was a high variation in EC from 2.09 dS m −1 to 46.70 dS m −1 , both extreme values found in bare soil. The average EC of all samples was 19.45 dS m −1 , and the median was 16.67 dS m −1 . The standard deviation was 9.98, the kurtosis was 0.39, the skewness was 0.94, and the CV was 51%, indicating variability in measured EC that showed near normal distribution. Vegetation mostly covered sampling points in the northwest, and the ratio of vegetation-covered soil to bare soil across the sampling points was approximately 1:2. Overall, the measured EC values were used as a grading standard of soil salinity according to the natural soil salinization classification standard: non-saline soil, EC ≤ 7.5 dS m −1 ; lightly salinized soil, 7.5 dS m −1 < EC ≤ 15 dS m −1 ; moderately salinized soil, 15 dS m −1 < EC ≤ 30 dS m −1 ; severe salinized soil, 30 dS m −1 < EC ≤ 60 dS m −1 ; and saline soil, EC > 60 dS m −1 [65,66]. The EC values of soil samples were mainly in the moderate and severe salinity categories. However, the coefficient of variation of 51% indicated a high variation in the measured soil EC. Spatially, severe salinity was mainly observed in the middle of the study area, mostly in desert and a few areas covered with vegetation.

Evaluation of the Accuracy of Estimations
After selecting independent variables, we developed predictive models using PLSR, CNN, SVM, and RF to estimate soil salinity. The bootstrap sampling method was used 50 times to calculate the uncertainty, and a fivefold cross-validation method was used to obtain the best parameters of the model with the minimum of RMSE. During the testing process, the simulated soil EC was evaluated against the measured EC (Table 6). After parameter optimization and bootstrapping on the training datasets for each model, we found the average estimation to be R 2 = 0.52 for the PLSR model, R 2 = 0.53 for the CNN model, R 2 = 0.68 for the SVM model, and R 2 = 0.75 for the RF model. The accuracy of bootstrap, vegetation-covered dataset, and bare soil dataset is shown in Table 6. Comparing the accuracy of estimation among vegetation-covered, bare soil, and total samples, we found the results from vegetation-covered areas to show higher accuracy but greater dispersion than the testing datasets of total samples for PLSR, SVM, and RF models. The accuracy of bootstrap for each model was similar to the accuracy of the testing datasets, while the RF model was lower. Figure 3 shows the uncertainty value of each testing point in order to assess the uncertainty of modeling. The mean uncertainty value of CNN model was 0.03, which was the lowest when compared to PLSR (0.06), RF (0.06), and SVM (0.08). Figure 4 shows the scatterplots of measured and estimated EC values of the testing data. For PLSR, SVM, and RF models, the estimated EC values were close to the measured EC values when the soil salinity was lower than 25 dS m −1 . However, the estimated EC values were underestimated for high measured EC values. For CNN models, the estimated and measured values were distributed on both sides of the 1:1 line of the scatterplots with higher dispersion. Overall, modeling with RF resulted in the highest R 2 and the lowest MAE, while the PLSR model yielded the lowest R 2 and highest MAE. Thus, by comparing four models above, we concluded that the RF model is the best regression model. Table 6. Accuracy comparison of partial least squares regression (PLSR), convolutional neural network (CNN), support vector machine (SVM), and random forest (RF) models.

Mapping of EC and Its Uncertainty
Finally, the RF model was employed to construct an EC map of the study area; after applying bootstrap on each pixel, we obtained 50 maps of soil EC. The average of the 50 maps was shown as the final map of the study area (Figure 5a). The estimated EC values varied from 7.19 to 38.3 dS m −1 . The northwest and the south of the study area exhibited relatively low soil salinity content. Generally, farmlands are located in these areas, and due to the farmland improvement policies and measurements, the EC values were low to make it suitable for growing crops. Figure 1 shows that the measured EC values of the soil samples in the southern part of the study area were lower than 15 dS m −1 ; thus, the soil EC values calculated by the RF model were similar to the measured result. However, in the south of the study area, the EC values of some farmlands were within the moderately salinized soil class (15 dS m −1 < EC ≤ 30 dS m −1 ). A high amount of salt was mainly concentrated in the northwest to the middle of the study area covering desert areas. Further, the EC values decreased from these areas to the south of the study area. From northwest to southeast of the study area, the EC values changed from low to high and then low again. Regardless of the existence of crops in the farmland, and comparing EC values to the distribution of halophytes, the highest estimated EC value appeared in the area where halophytes grew. In addition, the EC values of some vegetation-covered areas were in the moderately and severely salinized soil categories (30 dS m −1 < EC ≤ 60 dS m −1 ), while in general, the EC values of the vegetation-covered areas were lower.
The spatial distribution of the uncertainty value of the estimation is shown in Figure 5b. Among all pixels in the study area, the uncertainty value ranged from 0.01 to 0.27. The uncertainty values were low (< 0.15) in the northeast to the middle of the study area because most of the sampling points were distributed there, while they were relatively higher (> 0.15) in northwest and south of the study area because of the low density of sampling points. As Figures 3 and 4 show, for the data with lower (< 10 dS m −1 ) and higher (> 40 dS m −1 ) soil EC value, the deviation between the estimated value and the measured value was larger, and the uncertainty was stronger. Therefore, as shown in Figure 5, the estimated value of soil salinity in the farmland in the northwest and south of the study area was lower, and the uncertainty value was higher.

Estimation Capabilities of Different Models
For uncertainty assessment (Figure 3), four models showed relatively high uncertainty in data where EC value was lower than 10 dS m −1 or higher than 40 dS m −1 . Most soil EC values of data were in the range of 10 dS m −1 to 25 dS m −1 , and thus the estimation in this interval was more accurate when modeling. The plot of mean uncertainty showed that the linear model (PLSR) had the highest uncertainty, followed by classical machine learning model (SVM) and tree-based machine learning model (RF), while the neural network model (CNN) had the lowest uncertainty. According to the estimation accuracy of the four models on the testing datasets (Table 6), the estimation capabilities were ranked as follows: RF regression > SVM regression > CNN regression > PLSR.
In general, the estimation accuracy of the nonlinear model was better than that of the linear model. In regions with high spatial variability in salinity, we found that social management, environmental factors, and geological conditions synergistically affect the accumulation of salt on the soil surface. Linear regression cannot accurately simulate this complex process, and thus the estimation ability of linear models was poor [28]. However, in the model building process, nonlinear regression showed the algorithm's ability to incorporate a large number of disparate predictors [14]. For the CNN model, it needed a large number of samples for training in the modeling process to obtain the best simulation. Due to the lack of training data, the training times cannot be set too much to prevent overfitting. Therefore, the CNN model did not show its advantages in the training of small samples [34]. Similar to other neural network models (e.g., MLP-NN) used in the latest research, NN methods were less flexible in terms of the parameterization and computational efficiency compared to RF model [34].
Due to the classification framework of decision trees, the RF model performed with the best estimation capabilities among the four models. In areas with large soil salt heterogeneity, each regression tree in the random forest was classified and regressed, which fully reflected the difference in salt distribution. For the RF model, the input of all independent variables and automatic screening by the machine was beneficial to regression, and all the independent variables (38) were put into the model for test. However, the results showed the best accuracy of R 2 was 0.46, which was much lower than the accuracy after filtering the independent variables. Thus, the RF model was regarded as the best model with nine indicators including B6, B7, B10, B11, S6, GARI, GDVI, EVI, and DEM. In building tree-structured regression models, different observations of variable selection biases towards variates with splits can seriously affect the results [17,22].

Sensitivity of Water and Vegetation Coverage
It can be seen from the modeling procedure that the contributions of different independent variables on estimating soil salinity were different. For example, the movement and accumulation of salts on the soil surface is controlled by ecological, geomorphic, climatic, and hydrologic factors. These factors affect the soil-water balance [25,68]. The water resources in the study area came from precipitation, irrigation, and rivers, and the sampling time in July was the wet season in Xinjiang. There are three weather stations around the study area: Aksu station in the northwest, Avati station in the southwest, and Alar station in the southeast. Intermittent precipitation was recorded from 22 July to 24 July in accordance with the hourly observation data of these weather stations. A precipitation of 0.1-1.6 mm per hour, which is considered light rain, was recorded within this time. Moreover, precipitation as high as 10.4 mm per hour occurred in Aksu station on 25 July. Due to the precipitation right before the sampling work, the surface salt might have leached into the deep soil by the rainwater and reduced salt content at the surface layer in comparison to the previous study [28]. The soil backscattering coefficient is affected by vegetation, surface roughness, and radar system parameters, and the vegetation scattering theory only shows its superiority in areas with homogeneous vegetation [69]. For arid and semiarid areas where vegetation cover is only partial, the contribution of soil to backscatter is generally smaller than that of the vegetation [70,71]. Due to the precipitation and the growth of halophytes, the soil moisture in the study area was generally higher than usual in the short term, and the variability was not significant in these places. The soil backscattering coefficient of Sentinel-1 was thus not highly relevant for modeling (Table 5), and the main factor affecting the soil-water balance was the presence of vegetation in this study [23,44].
For the testing datasets, the accuracy was intermediate between the vegetation-covered and the bare soil datasets for PLSR, SVM, and RF models (Table 6). Vegetation showed a strong negative correlation with soil salinity. In areas with high vegetation coverage, the use of vegetation spectral indices as independent variables improved estimation accuracy efficiently. For bare soil areas and areas with sparse vegetation coverage, the growth of vegetation and soil moisture had little effect on the formation of salts, resulting in poor estimation accuracy [16,28]. For the CNN model, repeated convolution on the dataset was performed with small samples, weakening the effect of a single independent variable on the dependent variable. Therefore, the sensitivity of EC to moisture and vegetation was not significant, and the dispersion of the estimated values was uniform (Figure 4b).
As a result, in the dry season, the spectral indices, vegetation spectral indices, and soil moisture data can be used to characterize and distinguish saline and non-saline soils. In the wet season, however, precipitation affects the water-salt balance and causes the salt on the soil surface to penetrate into the deep soil in a short period of time. This leads to abnormal effects of soil moisture and vegetation on salinity estimation. Therefore, the dry season is more suitable for estimation of soil salinity than the wet season [11,31,50,72].

Interaction of the Independent Variables
Studying the contribution of independent variables to dependent variables and the interrelationship of independent variables can obtain the strongest factors affecting soil salinization. The correlation can provide a theoretical knowledge for subsequent research and improvement of the indices that characterizes soil salinity, which helps in soil salinity estimation, saline soil management, and farmland reclamation. To quantify the importance of the nine independent variables in RF modeling process, we used importance analysis and path modeling. We calculated 50× RF modeling, with the averaged importance of indicators being shown in Figure 6. DEM exhibited the highest importance of 0.45 for modeling, followed by B10 with modeling importance of 0.14 and GARI with modeling importance of 0.14. Independent variables were grouped according to their attributes. Terrain attributes exhibited the highest importance of 0.45. The remote sensing data had the second largest contribution with importance of 0.32, followed by vegetation spectral indices with importance of 0.21. The salinity spectral indices exhibited the least modeling importance of 0.089.
In order to eliminate the influence of multiple collinearity among the nine independent variables on the evaluation of the independent variable interaction, we grouped the independent variables into partial least squares regression path modeling [26]. The framework of path analysis was set first by determination of remote sensing band data on auxiliary data and then the inhibition by terrain attributes and salinity indices of plant growth [46]. The PLS-PM was then used to explain the interactions between groups of independent variables (Figure 7). It can evaluate the fitness and the predictive ability of the model [73]. The vegetation spectral indices and salinity spectral indices were calculated from the raw remotely sensed data, and thus the remote sensing data directly contributed to these two independent variable groups [74]. Terrain attributes with r = 0.44 contributed the most to estimated EC among the four independent variable groups ( Figure 6) [75]. Terrain attributes were related to vegetation with r = 0.70 and with salinity spectral indices with r = 0.32. In the estimation of soil salinity, only the vegetation spectral indices had a negative influence. The roots of plants can absorb salt, and thus the salt content of the soil where vegetation grows was relatively low, with saline soil also inhibiting the growth of vegetation. The interaction of the three groups showed that DEM greatly affected the distribution of vegetation and the accumulation of salt, being a key factor in the estimation of salt in this study. In southern Xinjiang, DEM and vegetation coverage can be evaluated as important covariates to improve the equation that characterize soil salinity [76]. The relevance ranking of the four independent variable groups matched the modeling importance of the RF model, indicating that the RF model exhibited accurate modeling of independent variables.

Effects of Human Activities on Soil Salinization
Figures 6 and 7 indicate that DEM was the factor with the greatest positive correlation to estimate soil salinity. In previous studies, population, cultivated area, and number of livestock were the main factors for differences in salinization in a basin [77]. In general, humans prefer to settle in low-lying areas near water. For example, a town is distributed along the Tarim River in the southern part of the study area. The farmlands in this part of the study area were reclaimed to make it more suitable for human life, leading to higher salt content in the soil outside the oasis regions [78]. The topography of the alluvial fan in oasis regions affected the distribution of soil salinity, and the water and soil nutrients accumulated in the middle and lower reaches. In addition, there was a large amount of artificially reclaimed farmland in the south of the study area. Therefore, vegetation grew in these areas and the soil salt content was reduced. During precipitation and irrigation, water accumulates in areas with lower elevation, and the groundwater level rises and brings salt to the surface of the soil. Thus, if there were no interference from human activities, the soil EC should show a low value in the northwest of the study area, the upper part of the alluvial fan [28]. Due to the reclamation in the southern portion of the study area, the soil salinity content decreased significantly compared with levels before the reclamation, which changed the original spatial distribution of soil salinity [16,28]. Soil EC showed lower values in the lower terrain, and there was a strong positive correlation between elevation and soil salinity (Table 5 and Figure 6). The farmland in the northwest was well cultivated, and the soil EC was relatively low compared to the soil around this area. The estimated soil EC value of the farmland in the northwest was found to be under 15 dS m −1 , while the estimated soil EC value of the surroundings was above 30 dS m −1 . Most of the farmland in the south of the study area was newly cultivated since 2010 but had not been completely improved, and thus the salt content was still higher than in the farmland in the northwest. This result shows farmland reclamation can partially de-salinize the soil with high salt content; therefore, the salinity of the top soil will drop sharply, making it suitable for crop planting.

Conclusions
In this study, nine independent variables involving four categories (remote sensing data, terrain attributes, salinity spectral indices, and vegetation spectral indices) and four models (PLSR, CNN, SVM, and RF) were selected to estimate soil salinity in Xinjiang. The main results are as follows: 1.
Overall, among the 35 factors considered, 9 indices (B6, B7, B10, B11, S6, GARI, GDVI, EVI, and DEM) were significant and contributed to the estimation of soil salinity. The random forest regression model was the best model in this study, with better model performance and accuracy measures (R 2 = 0.75, RMSE = 7.33 dS m −1 ).

2.
According to the EC map derived from the random forest model, the northwest and the south of the study area showed relatively low soil salinity. Salt was mainly concentrated in the northwest to the middle of the study area covering the desert.

3.
Farmland reclamation affected the distribution of soil salinity, resulting in a strong positive correlation between elevation and soil EC. Moreover, the variability of soil salinity was sensitive to moisture and vegetation. Dry season was more conducive for the estimation of soil salinity.