Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP

Shi, Shuaishuai; Wang, Yu; Wang, Jiawen; Yang, Jibang; Bai, Zijin; Peng, Jie

doi:10.3390/rs18060955

Open AccessArticle

Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP

by

Shuaishuai Shi

^1,†,

Yu Wang

^2,†,

Jiawen Wang

²,

Jibang Yang

¹,

Zijin Bai

^3,* and

Jie Peng

^1,4

¹

College of Agriculture, Tarim University, Alar 843300, China

²

College of Life Sciences and Technology, Tarim University, Alar 843300, China

³

College of Horticulture and Forestry, Tarim University, Alar 843300, China

⁴

Key Laboratory of Genetic Improvement and Efficient Production for Specialty Crops in Arid Southern Xinjiang of Xinjiang Corps, Alar 843300, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(6), 955; https://doi.org/10.3390/rs18060955

Submission received: 9 January 2026 / Revised: 14 March 2026 / Accepted: 19 March 2026 / Published: 23 March 2026

(This article belongs to the Special Issue Environmental Monitoring Based on Remote Sensing, Earth Observation and Geoinformation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Feature number impacts soil salinity evaluation accuracy.
Feature selection enhances salinity evaluation accuracy at a regional scale.
SHAP analysis identifies CRSI, BI, and MSAVI2 as the most influential predictors for soil salinity in the study area.
A set of salinity data for southern Xinjiang is presented for the first time.

What are the implications of the main findings?

This study confirms that optimization improves the accuracy and transferability of multi-source remote sensing-based soil salinity inversion models.
SHAP values explain feature selection and identify key features for regional salinity estimation.
Integrating multi-source remote sensing data, feature selection, SHAP and RF models enables high-precision and rapid online salinity mapping, facilitating subsequent applications.

Abstract

Soil salinity severely threatens global ecosystems and agriculture, making accurate monitoring an ongoing priority. Currently, efficiently utilizing multi-source datasets to enhance monitoring accuracy while minimizing computational resources remains a critical challenge. This study evaluated several modeling strategies, including full-dataset modeling, variance inflation factor (VIF), Boruta, particle swarm optimization, ant colony optimization and recursive feature elimination (RFE), and validated results across diverse regions (Almaty, Kazakhstan; Shandong, China). We further validated the results using multiple algorithms, including linear regression, partial least squares regression, extreme gradient boosting, k-nearest neighbor and random forest (RF), with topsoil (0–20 cm) electrical conductivity inverted via the optimal method. Results indicate that input feature numbers substantially impact model performance: regional-scale feature selection is indispensable, with RFE outperforming full-dataset modeling (R² improves by up to 0.28, while RMSE decreases by 2.21 dS m⁻¹) and VIF performing the worst. Transferability is also demonstrated in Almaty and Shandong. Additionally, the RF algorithm shows superior performance in soil salinity mapping (overall accuracy = 0.73; kappa coefficient = 0.65). And, the RFE and SHAP results highlight CRSI, BI, and MSAVI2 as particularly important predictors for estimating soil salinity in our study area. Collectively, this study highlights the critical importance of feature optimization and interpretability in soil attribute mapping through the integration of multi-source remote sensing data.

Keywords:

soil salinity; multi-source; remote sensing; Google Earth Engine; feature selection

1. Introduction

Excess soluble salt causes soil salinization, a significant agricultural issue [1,2,3] causing land degradation [4,5], reduced agricultural productivity [6], water pollution [7], and ecological damage [8]. Soil salinization is a common land degradation problem globally [9]. Many countries, including Argentina, China, India, Pakistan, Sudan, and the USA, are concerned about the derivation and distribution of soil salinization [10,11]. The situation of soil salinization is not optimistic. According to recent statistics, there are approximately 1381 million ha of salt-affected soil worldwide, and about 10.7% of soil is impacted by salt [12]. Moreover, saline soil increases by 10% annually for various reasons, including low precipitation, high surface evaporation, saline irrigation, and poor cultural customs [13], especially in the dryland regions [14]. In Xinjiang, China, the soil salinization issue is particularly emphasized due to its arid climate and unique geographic location [15]. The resulting land degradation and scarcity of high-quality water resources have become major limitations to the stability and sustainable development of oasis ecosystems [16,17]. Therefore, characterizing the spatial distribution of soil salinity is of significant strategic significance for effective soil and water resources management and for promoting sustainable regional development [18].

Remote sensing (RS) has been widely utilized in soil salinity monitoring and combining it with field electrical conductivity (EC) data can accurately map large-scale soil salinity [19,20,21]. Among the various satellite platforms, the Sentinel series has recently emerged as a reliable tool for estimating soil salinization [22]. Sentinel-2 multispectral images can construct multiple RS indices and have demonstrated strong applicability in soil salinity prediction [23]. However, its performance is susceptible to cloud cover [24]. Comparatively, Sentinel-1 synthetic aperture radar (SAR) images offer greater penetration capability under all-weather conditions and provide more robust surface reflectance data [25]. Sentinel-1/2 are freely available and provide high-temporal-resolution time-series data, and the integration of the two can substantially enhance the accuracy of soil salinity assessment [18]. Recently, multi-source RS monitoring based on Sentinel-1/2 has become a research hotspot [25,26,27]. Leveraging the complementary strengths of multispectral and SAR imagery enables more detailed information to be gathered about surface features as well as enhances the results of practical applications [27]. Li et al. [26] found their validation set R² values were 0.59 and 0.18 when using Sentinel-1/2 and DEM individually but increased to 0.66 when integrating them. Similarly, Ma et al. [25] showed that combining Sentinel-1/2 with DEM improved validation accuracy by 0.09 and reduced the RMSE by 1.43 dS m⁻¹, compared with using Sentinel-2 and DEM data only. These findings highlight that integrating Sentinel-1/2 with DEM can markedly improve the monitoring accuracy of soil salinity.

Integrating Sentinel-1/2 and DEM enhances model prediction accuracy while increasing the number of features. However, those additional features may be redundant: some of them contribute similarly or contain irrelevant data uncorrelated with the target feature [28]. For instance, Li et al. [26] obtained 47 features from Sentinel-1/2 and DEM but ultimately only used the top 10 vital features for mapping, according to a random forest (RF) model. The RF model evaluates the importance of RS features by calculating the variance of the features, thereby dealing better with the issue of multicollinearity. Except for a few models that can directly sort features, we mainly use feature selection algorithms for feature filtering. For example, Pratama et al. [29] and Das et al. [30] used recursive feature elimination (RFE), and Boruta was used by Zhang et al. [31]. Still, some researchers excluded RS features with absolute correlation coefficients exceeding 0.9 [32,33,34,35,36]. However, a few studies found that filtering features did not necessarily improve model performance. Taghadosi et al. [28], for example, compared sequential feature selection and a genetic algorithm after artificially removing 12 features and found that the prediction set’s R² decreased across different support vector machine (SVM) kernels relative to using all features. Thus, some studies chose to disregard the effects of feature selection in their modeling workflows [37,38]. Thus far, there is no systematic method for monitoring soil salinization, which severely hinders the development of practical, systematic applications. Bouasria et al. [39] and Khaire and Dhanalakshmi [40] emphasized in their studies that machine learning outcomes are influenced by the spatial range of the study area. Depending on the size of the sampling area, the final evaluation accuracy may present disparity, even when using the same methods. Although several studies have compared the outcomes of different feature selection methods [41,42], systematic analysis and interpretation are still lacking. Most current research overlooks the scale effects and transferability of data, limiting findings to specific contexts. Thus, the topic of whether feature selection needs to be done for RS data over varying scales and which methods to choose remains to be explored deeply.

Moreover, for practical applications, quickly and accurately assessing soil salinity is a key concern [37]. The Google Earth Engine (GEE) platform greatly facilitates remote sensing monitoring of soil properties [43]. On the one hand, users can directly perform image processing and feature extraction on the GEE platform, greatly reducing the time required for image downloading. On the other hand, GEE integrates some commonly used algorithms and models, allowing for direct application of RF and others. Additionally, GEE training and mapping are faster and more convenient, making it highly suitable for practical applications.

This study investigates the role of feature engineering in regional-scale data and explores how to utilize multi-source RS data to map soil salinity distribution quickly and accurately, providing a practical method for future applications. The specific objectives are as follows: (1) determine the effects of several mainstream feature optimization methods in soil salinity assessment; (2) visualize the modeling process using the LR model and SHAP analysis; (3) validate the robustness of the results using different modeling approaches and datasets; (4) calculate inverse regional-scale EC by the optimal method on GEE.

2. Materials and Methods

This study integrates multi-source RS data and evaluates the impact of different data processing methods on model performance (Figure 1). The detailed processes for integrating multi-source RS data in the GEE platform for soil salinity high-precision mapping are outlined as follows:

(1): Preprocess and extract features from RS imagery on GEE platform.
(2): Conduct correlation analysis and construct modeling sets using different approaches, including (i) all features, (ii) features selected through covariance analysis, and (iii) features selected by feature selection algorithms.
(3): Apply various regression models to observe the validation sets’ accuracy for determining the optimal approach.
(4): Observe the impact of feature numbers on model accuracy using RFE and identify key features with SHAP values.
(5): Validate the stability of the results using multiple models and regional datasets.
(6): Perform soil salinity mapping using the optimal feature set on the GEE platform.

2.1. Data Preprocessing

2.1.1. Soil Samples and Preprocessing

Our study area consists of three parts (Figure 2). Site 1 is located at the Kongtaileke Ranch on the edge of the Taklamakan Desert. This area has a temperate continental arid climate with strong evaporation and low precipitation, commonly causing high salt content in the soil [44]. Annual rainfall averages less than 60 mm, mostly occurring between May and September, while annual evaporation exceeds 2400 mm [45]. The soil is mostly deserted and saline, with low soil fertility. The natural vegetation is dominated by desert vegetation, such as rose willow (Tamarix chinensis Lour.) and camel thorn (Alhagi camelorum Fisch.). Cultivated land is centralized in oasis areas and the main crop is cotton. The research team conducted the sample collection from 14 August to 17 August 2021 and planned to follow two crossing main roads considering the traffic conditions. A total of 186 samples were obtained during this period, with sampling depths of 0–20 cm, and the coordinates of the center points were used as the coordinates of the sample points after obtaining the mixed samples. The soil samples were air-dried in the laboratory and then used to prepare a saturated slurry with a soil–water ratio of 1:5. Electrical conductivity (EC dS m⁻¹) was measured at room temperature (25 °C) using the DDS-307A conductometer (Shanghai Inesa Scientific Instrument Co., Ltd., Shanghai, China).

Site 2 sample data were obtained from Almaty Province in southeastern Kazakhstan. Site 1 and site 2 have similar climatic conditions, but site 2 is not as extremely arid as site 1, with annual average precipitation reaching 300 mm. Almaty Province samples were collected in May–July 2022, and samples were obtained from three counties (Shelek, Kapchagay, and Alakol), totaling 201. EC data for site 2 were obtained from Mukhamediev et al. [19], where measurements were conducted with a HannaGroLineHI9814 (Hanna Instruments Inc., Woonsocket, RI, USA) using a 1:5 soil-to-water ratio.

Soil salinity data from site 3 were collected in Shandong Province, China, in 2020, comprising 294 samples obtained through a grid-based sampling strategy. EC was measured with a METTLER TOLEDO SevenCompactTM S230-USP/EP-CN conductometer using a 1:5 soil–water ratio, and the results were converted to soil salinity content (g kg⁻¹) using the corresponding formula. The climatic conditions and salinization processes in this region differ markedly from those in the primary study area. In coastal regions, direct seawater intrusion combined with extensive human activities drives pronounced spatial heterogeneity in soil salinity. Moreover, site 3 is characterized by a warm temperate monsoon climate, with annual precipitation ranging from 556 to 1281 mm—substantially higher than that of sites 1 and 2. The higher rainfall, together with frequent cultivation practices, generally results in lower soil salinity levels at site 3. A detailed description of site 3 can be found in Chi et al. [46]. Compared with site 2 and site 3, site 1 is more arid and exhibits a higher degree of soil salinization, making it more suitable for mechanistic studies of soil salinization. In contrast, site 2 and site 3 are characterized by different climatic conditions and salinization processes, which makes them appropriate for testing the transferability of models and methodologies. The detailed information for samples from each site is shown in Table 1.

2.1.2. Image Processing and Feature Extraction

In this study, Sentinel-1 GRD data (COPERNICUS/S1_GRD) on 15 August 2021, Sentinel-2 data (COPERNICUS/S2_SR_HARMONIZED) on 17 August 2021, and SRTM DEM (USGS/SRTMGL1_003) data were acquired on the GEE platform. We first preprocessed the images, including thermal noise removal, terrain correction, and radiometric calibration of the Sentinel-1 images. After bilinear resampling to 10 m, all the features in Table 2 were extracted from GEE according to Ma et al. [25] and Li et al. [26].

2.2. Feature Optimization Methods

In this study, VIF was performed entirely in IBM SPSS Statistics 26, and ACO and PSO algorithms were implemented in MATLAB 2022b, while the remaining feature selection algorithms along with subsequent model building and visualization were carried out in a Python 3.10 environment. And mapping tasks were completed on GEE.

2.2.1. Based on VIF Method

Multiple covariance analysis is essential when performing multiple linear regression. A common technical approach used in this process is using the variance inflation factor (VIF) to evaluate the correlation between RS features. The formula for the VIF is given below:

V I F = \frac{1}{1 - R^{2}}

(1)

where R² is the determination coefficient between RS features.

From the above equation, the larger the R², the higher the VIF, and the stronger the linear relationship between the features. Currently, the widely accepted opinion is that a VIF > 10 means that there is strong covariance between the features.

2.2.2. Boruta

The Boruta algorithm generates shadow features by randomly disordering the input features and adding noise to them and then compares the importance of the input features and the corresponding shadow features to determine whether the features are important or not. The selection process of Boruta is without human intervention and the importance scores can be generated for the interpretation and visualization of model results.

2.2.3. RFE

RFE decreases the feature numbers by iteratively training the learner and gradually eliminating unimportant features. But, when using RFE, we need to artificially preset a reasonable number of features, and RFE will select the features you need according to the importance ranking.

2.2.4. PSO

Particle swarm optimization (PSO) is inspired by the collective movement rules of bird flocks, and a swarm of particles with distinct positions and velocities is generated to represent different candidate solutions [47]. Initially, the algorithm starts with a set of random solutions; during the iterative process, the velocities and positions of particles are updated by tracking two extremum particles (personal best, P_best; global best, G_best). The update mechanism is formulated in Equations (2) and (3).

V_{i} (t + 1) = w V_{i} (t) + c_{1} r_{1} (P_{{b e s t}_{i}} (t) - X_{i} (t)) + c_{2} r_{2} (G_{b e s t} (t) - X_{i} (t))

(2)

X_{i} (t + 1) = X_{i} (t) + V_{i} (t + 1)

(3)

2.2.5. ACO

Ant colony optimization (ACO) is inspired by the behavior of ants that release pheromones along paths and select paths with higher pheromone concentrations during foraging [48]. Initially, multiple “ants” construct feasible solutions in parallel and make stochastic selections based on local pheromones and heuristic information. The selection probability is expressed in Equation (4):

P_{i j}^{k} (t) = \{\begin{matrix} \frac{τ_{i j}^{α} η_{i j}^{β}}{\sum_{l} τ_{i j}^{α} η_{i j}^{β}} \\ 0 \end{matrix}, I f l a n d j a r e a d m i s s i b l e

(4)

where

P_{i j}^{k}

denotes the transition probability of the k-th ant from node

(i, j)

,

τ_{i j}

is the pheromone concentration at node

(i, j)

, and

η_{i j}

represents the heuristic information at node

(i, j)

.

α

and

β

are parameters that regulate the relative importance of pheromones and heuristic information.

Subsequently, ACO evaporates and reinforces pheromones (based on solution quality); multiple iterations lead to the amplification of high-quality solutions and the attenuation of inferior ones, as shown in Equation (5):

T_{i j} (t + 1) = (1 - ρ) τ_{i j} (t) + \sum_{l = 1}^{m} ∆ τ_{i j}^{k} (t) + ∆ τ_{i j}^{g} (t)

(5)

where

T_{i j} (t)

is the pheromone concentration at time

t

,

ρ \in (0, 1]

denotes the pheromone evaporation coefficient,

m

is the size of the ant colony,

∆ τ_{i j}^{k}

is the pheromone concentration deposited by the k-th ant at node

(i, j)

, and

∆ τ_{i j}^{g}

represents the pheromone concentration accumulated by the globally optimal ant

g

at node

(i, j)

.

2.3. Model and Model Assessments

2.3.1. Regression

To establish the relationship between soil electrical conductivity (EC) and the predictor variables, we implemented and compared several commonly used regression algorithms, including linear regression (LR), partial least squares regression (PLSR), extreme gradient boosting (XGB) regression, k-nearest neighbor (KNN) regression and random forest (RF). LR and PLSR were employed as baseline linear models, suitable for handling multicollinearity among spectral indices, while RF and XGB were chosen as representative ensemble tree-based approaches with the ability to capture nonlinear relationships. KNN regression was included as a distance-based non-parametric method. Our experimental approach is as follows: (1) Use a simple LR model to observe the accuracy corresponding to different feature optimization methods. (2) Employ RF-FRE to explain the impact of feature quantity on model accuracy. (3) Assess the robustness of validation results across different models. (4) Compare the soil salinity mapping outcomes of several methods on GEE according to the optimal strategy derived from the above steps. Model development and training were conducted in Python, using the sklearn and xgboost libraries, with empirical hyperparameter optimization. Two-thirds of the dataset was used for calibration, and one-third was reserved for validation. The parameters for each model are shown in Table 3.

2.3.2. Classification

For soil salinity classification, this study first segmented the dataset locally, converting soil EC data into salinization severity grades. Following the methodology of Omuto et al. [49], soil EC data from site 1 were classified into six categories (Table 4). Values ranging from none to extreme were assigned sequential numerical values from 1 to 6, respectively, for salinization level assessment. Here, 1 indicates no salinization, while 6 represents extremely severe salinization. Then, we compared several classification models on GEE, including gradient tree boosting (GTB), classification and regression trees (CARTs), RF, support vector machine (SVM), KNN, and Naive Bayes. To ensure robust model evaluation, the dataset was randomly partitioned into 70% for calibration and 30% for validation. And the parameters for every model are shown in Table 3.

2.3.3. Accuracy Assessment

Model performance was evaluated using different sets of indicators for regression and classification tasks. For regression models, evaluation indicators included the coefficient of determination (R²), root mean square error (RMSE), Lin’s concordance correlation coefficient (LCCC), and performance to interquartile range (RPIQ), with the calculation formulas as follows:

R^{2} = \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - {\bar{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}}

(6)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(7)

L C C C = \frac{2 \times r \times s_{Y_{C}} \times s_{Y_{V}}}{{s_{Y_{C}}}^{2} + {s_{Y_{V}}}^{2} + ({\bar{Y}}_{C} - {\bar{Y}}_{V})}

(8)

R P I Q = \frac{Q 3 - Q 1}{R M S E}

(9)

where n stands for the sample number,

y_{i}

and

{\hat{y}}_{i}

are the observed and predicted soil EC values, and

Q 1

and

Q 3

are the first and third quartiles of the observed values, respectively.

r

is the Pearson coefficient between the observation and prediction, and

s_{Y_{C}}

and

s_{Y_{V}}

are the standard deviations of each dataset.

For classification models, accuracy was assessed using the confusion matrix, from which the overall accuracy and kappa coefficient were derived [37].

2.4. Shapley Additive Explanation

SHAP is a game-theoretic approach for interpreting machine learning models, to address the “black box” nature [50]. We used SHAP values to visualize the decision-making process. In this study, SHAP values were computed for all input RS features and ranked according to their mean absolute contribution. The SHAP summary plot illustrates the relative importance of each feature, and the analyses were conducted in Python, using “explainer”.

3. Results

3.1. Descriptive Statistics of Soil Samples

The statistical characteristics of the soil salinity data used for analysis are shown in Figure 3. At the Kongtaileke Ranch (site 1), soil EC ranges from 0.08 to 44.35 dS m⁻¹ with an average of 11.78 dS m⁻¹, indicating an extremely high degree of salinization. In contrast, the soils in Almaty Oblast, Kazakhstan (site 2), have an EC range of 0.04–13.34 dS m⁻¹ with a mean of 0.97 dS m⁻¹, reflecting relatively low salinization levels. Kongtaileke Ranch is in a remote desert Gobi region with minimal human disturbance, whereas most sampling sites in Almaty Oblast are distributed in croplands, leading to overall lower salinity. The distribution of sampling points shows that soil salinization is the lowest in northern coastal Shandong, China (site 3); like the Almaty region, this densely populated landscape is dominated by croplands and forests, with an average EC of only 1.41 g kg⁻¹. The salinization levels exhibit significant differences across the three regions. The salinity levels at site 1 differ significantly compared to the other regions, with a median value of 11.01 dS m⁻¹—close to the highest value recorded at site 2—and a standard deviation of 10.64 dS m⁻¹, which is markedly higher than that of the other regions. Data exists for all six grades at site 1, with the extreme grade containing 70 data points, followed by the none grade, which contains 32 sample points. It is a common phenomenon in machine learning models for high-value underestimation and low-value overestimation to occur. There are more low- and high-value EC data points in site 1, which helps improve the model’s fitting ability in extreme-value regions.

3.2. Correlation Analysis and Feature Selection

The correlation of individual RS features with EC and internal correlation within the features are shown in Figure 4a. In overall terms, the correlation between the extracted multispectral features and EC is better, and the absolute value of the correlation coefficient is generally above 0.5, while the SAR features are relatively weaker, with the largest absolute correlation coefficient for VH corresponding to 0.28. However, as shown in the annotation part of Figure 4, using an absolute correlation coefficient threshold of 0.9, there is strong multicollinearity between RS features, both multispectral features and SAR features. This phenomenon is even more pronounced when thresholds of 0.8 or even 0.7 are employed.

For this purpose, this study selected four feature optimization methods, including VIF, ACO, PSO and Boruta, which were used to observe the correlations between selected features (Figure 4). As illustrated in the figure, VIF effectively reduces correlations by eliminating collinear features, the correlation between the features is generally lower, and the multicollinearity features are all eliminated. Among the seven features retained by Boruta, the correlation coefficient between BI and MSAVI2 is 0.94, and collinearity still exists. In contrast to previous results, although PSO and ACO significantly reduce features with high multicollinearity, the absolute correlation coefficients between a few features remain above 0.9.

According to the analysis of correlation, multicollinearity is substantially reduced, and the retained features exhibit strong correlations with EC. Nonetheless, some high correlation features may also be eliminated, reflecting methodological differences between feature optimization approaches. In theory, reducing redundant features enables models to assign greater weight to the most important predictors, thereby enhancing predictive accuracy. However, eliminating strongly correlated features may also compromise model performance, and thus the net effect must be assessed from the modeling perspective.

3.3. Effect of Feature Selection on Model Accuracy

Initially, the LR model was applied to examine how different feature optimization approaches influence prediction accuracy (Figure 5). For regional datasets, the absence of feature selection leads to large differences in the calibration and validation results of the model (calibration the set R² is 0.61, but the validation set R² is only 0.42; Figure 5a). Although the VIF eliminates feature collinearity, it does not improve prediction accuracy and in fact reduces performance compared to using all features (R² decreased from 0.42 to 0.33). By contrast, Boruta improves prediction accuracy while substantially reducing feature collinearity (R² improves from 0.42 to 0.47 and RMSE decreases by 1.03 dS m⁻¹). Notably, the LR model constructed using features selected by PSO and ACO demonstrates high predictive accuracy, with R² reaching 0.53, an RMSE of 7.69 dS m⁻¹, and an RPIQ of 2.27.

SHAP values explain the impact of feature selection in this process (Figure 5). When constructing the LR model with all features, the SHAP values of P6-P8 are the highest, while those of the remaining features are completely masked (all close to 0), indicating that P6, P7, and P8 entirely dominate the EC prediction process using LR. However, the maximum correlation coefficient between P6-P8 and EC is only 0.26, reflecting a low correlation; thus, the most critical information is overshadowed by these secondary features, resulting in slightly lower accuracy. In contrast, feature optimization eliminates the dominant influence of individual features on the results, and elevation, salinity indices, polarimetric combination indices, and other features all play a certain role in the prediction process. For instance, among the features selected by the VIF, SI4, elevation, and P7 are the three most important for EC prediction; however, except for SI4 (with an absolute correlation coefficient of 0.51), the other two features have low correlations with EC (0.39 and 0.26, respectively), leading to unsatisfactory accuracy of the LR model. In contrast, the results of Boruta, ACO, and PSO all retain features with high correlations (e.g., MSAVI2, NDVI, and BI), causing the validation sets’ accuracy to be higher than that of the full dataset and VIF method.

Meanwhile, RFE was performed to examine the effect of feature dimensionality (Figure 6). Results show that RFE consistently improved prediction accuracy relative to no selection (all R² > 0.37), and the prediction set R² improved at most by 0.14. Moreover, Figure 6 also shows that the difference between the model calibration set and the validation set gradually increases as the number of features increases. This reveals that the assessment model is gradually derailed as features with multicollinearity are added. It is also obvious that for RFE, the optimal number of features is nine, when the calibration and validation results are closest (the R² values for the calibration and validation sets are 0.52 and 0.51, respectively). And among the features, CRSI, BI, and MSAVI2 contribute most significantly to salinity prediction, with mean absolute SHAP values all exceeding 1.5. Additionally, elevation emerges as a crucial factor influencing the RF model’s decision-making process, with an average absolute SHAP value surpassing 0.6, explaining the trend of lower salinity at higher elevations within the study area.

In addition, to verify the robustness of the results across different models, we also used PLSR, XGB, and KNN for testing (Figure 7). Given the limited applicability of ACO and PSO, we only present the comparative results of the VIF, Boruta, and RFE. The results show that the different processing methods have slightly different effects depending on the models but consistently show worse results for the VIF and better results for Boruta and RFE. Compared to full-dataset modeling, VIF filtering reduced the R² of the validation sets by 0.01–0.22 and increased the RMSE by 0.06–1.11 dS m⁻¹. In contrast, compared to full-dataset modeling, RFE maximally increased the validation set R² by 0.28 and decreased the RMSE by 2.12 dS m⁻¹ and the LCCC improved by 0.20. Without feature selection, PLSR outperformed the other two machine learning models with a prediction set R² of 0.49, while after selection, KNN surpassed both PLSR and XGB, achieving a validation set R² of 0.66. Across all models, Boruta and RFE consistently led to higher predictive accuracy, followed by no feature selection, with the VIF producing the lowest performance. However, in the KNN results, both Boruta and RFE showed underfitting accuracy, whereas RF achieved a validation set R² of 0.51 when using nine features, yielding superior overall performance. Thus, the RFE-RF method is a more suitable evaluation approach.

Overall, feature selection achieves higher prediction accuracy by removing most of the redundant features, highlighting its importance for regional datasets, not only for minimizing collinearity but also for improving the accuracy of model evaluation. While the VIF effectively reduces redundancy, its reliance solely on statistical relationships between features limits its ability to preserve variables most relevant to the target, thereby constraining model accuracy. In addition, RFE achieved an R² of 0.51 on the validation set with just nine features, attaining a performance comparable to ACO and PSO.

3.4. Qualitative and Quantitative Assessment of Soil Salinity

Figure 8a shows the global Moran’s I values for the nine RS features derived from RFE to quantify the spatial autocorrelation, as calculated in ArcGIS 10.8.2. The results show that, excluding IPVI, all other variables exhibit strong positive spatial clustering, with values ranging from 0.291 (SR) to 0.567 (DEM). Specifically, EC (0.625) and DEM (0.567) have the highest Moran’s I values, indicating significant spatial clustering, whereas the autocorrelation coefficient for IPVI (0.0261) is near zero, indicating that its spatial distribution is nearly random. Based on the above indicators, six classification algorithms implemented on the GEE platform were used for soil salinization level classification analysis (Figure 8b). The six classification algorithms differ significantly, with tree-based algorithms proving the most reliable. GTB, CART, and RF all have high classification accuracy, with the overall accuracy ranging from 0.68 to 0.72 and the kappa coefficient from 0.57 to 0.65. SVM performed comparably to the CART, with an overall accuracy of 0.69 and a kappa coefficient of 0.59. KNN ranked third, showing markedly lower accuracy than the four leading algorithms (overall accuracy is only 0.59, and kappa coefficient is 0.48). The mathematical-based Naive Bayes algorithm performed the worst for soil salinization level assessment, with an overall accuracy below 0.50.

The qualitative assessment results of salinization level results derived by different algorithms on GEE are shown in Figure 9. Except for Naive Bayes, the results derived by the other algorithms reveal the spatial similarity of the soil salinization levels. Specifically, the soil salinization level of the region is obviously affected by terrain and human activities, showing a spatial distribution situation of severe in the middle and slight in the north and south. The middle part of our study area is almost undisturbed Gobi, affected by the regional climate, where surface-accumulated salts are easily visible on the topsoil. This is the area worst hit by soil salinization within the oasis. Notably, among the six classification algorithms we used, RF outperforms not only in classification accuracy but also in detail portrayals. And some authors have recommended using RF modeling in their research [37,51,52]. Thus, we also utilized RF to invert the regional soil EC.

Soil EC in the study area was quantitatively retrieved using the RF algorithm (Figure 10). The results are consistent with the qualitative patterns shown in Figure 10, displaying a spatial trend of “low in the north and south, and high in the center”, influenced by elevation and human activities. Specifically, the central desert Gobi region exhibits considerably higher salinity, with soil EC values generally exceeding 20 dS m⁻¹, while mountainous and farmland areas show much lower levels, with most pixels below 8 dS m⁻¹. Similarly, intermountain depressions are characterized by relatively high salinity, with most pixel values above 14 dS m⁻¹. However, discrepancies are observed in the central salt-tolerant vegetation region: field measurements are mostly above 12 dS m⁻¹, whereas the model identifies them as being at the strong level (4–8 dS m⁻¹). This mismatch is likely due to limited field sampling in inaccessible areas, resulting in insufficient training samples. Overall, soil salinity at site 1 ranges from 2.72 to 35.94 dS m⁻¹, slightly lower than the observed values (0.08–44.35 dS m⁻¹) yet within a reliable range.

Figure 10b shows the spatial distribution of uncertainty in soil EC predictions, with standard deviation (SD) values ranging from 0.004 to 7.033 dS m⁻¹. Areas of high uncertainty (red regions) coincide with areas of high soil salinization, particularly in the southwestern regions covered by salt-tolerant vegetation. These areas are persistently covered by salt-tolerant green vegetation throughout the year; however, this vegetation obscures soil information, resulting in a significant discrepancy between the predicted low salinity levels and the reality of extremely high soil salinization. Furthermore, this reflects that the RF model exhibits greater variability—and consequently higher error (larger RMSE)—when applied to areas with sparse training data or complex soil–topography interactions. In contrast, the southern cultivated regions covered by crops generally exhibit low uncertainty (green areas), where stable environmental conditions and dense sampling result in more stable predictions. Similarly, the northern mountainous regions generally exhibit lower levels of salinization, resulting in consistently lower prediction errors. The coexistence of high salinity and high uncertainty in the southwestern region indicates that model performance is highly sensitive to the strong spatial clustering of salt-affected soils, underscoring the need for targeted sampling in these critical areas to reduce prediction variability.

3.5. Validation of Methodology

We further conducted transfer validation using soil EC (dS m⁻¹) data from site 2 and soil salinity (g kg⁻¹) data from site 3 (Figure 11). The modeling results for both sites under different approaches are presented in Table 5. Site 2 utilized the dataset provided by Mukhamediev et al. [19], whereas site 3 employed median composite images from May 2018–2020 to calculate the variables listed in Table 2. The results reveal that for soil EC data in southeastern Kazakhstan, model accuracy varied substantially with the number of features. The highest accuracy was achieved when the number of features was around 100, with the prediction set R² increasing by 38% (0.24 to 0.33) and RMSE decreasing by 6% (0.8 dS m⁻¹ to 0.7 dS m⁻¹) compared with the full dataset. Relative to VIF-selected features, the prediction set R² improved by 0.18, while RMSE decreased by 0.1 g kg⁻¹. For soil salinity data in eastern coastal China, model performance was likewise strongly influenced by feature number. The best results were obtained with approximately 20 features, yielding an increase of 0.05 in the prediction set R² and a reduction of 0.09 g kg⁻¹ in RMSE compared with the full dataset. Compared with VIF-selected features, the prediction set R² improved by 0.2 and RMSE decreased by 0.29 g kg⁻¹. SHAP summary plots indicate that soil salinity predictions are influenced by many factors, with the dominant factors exhibiting diversity. In the EC assessment at site 2, amma_vv_2, green_3, and blue_5 are the most important features, with original band data contributing significantly to salinity prediction, whereas for soil salinity data at site 3, vegetation indices are the primary factors, far surpassing single bands. Among these, ENDVI, NDWI, GDVI, DVI, and SI significantly influence the model.

4. Discussion

4.1. The Significance of Multi-Source Integration and Feature Optimization

With the increasing availability of RS data sources and the extended operational lifespan of sensors, we have now entered the era of big data in RS [53]. Big data-driven RS monitoring has become a prominent research frontier. However, despite the complexity, diversity, and high heterogeneity of RS big data—together with its potential value—we still lack effective strategies for its full utilization [53], which poses a critical bottleneck in accurately capturing the spatiotemporal dynamics of soil salinization, a process inherently governed by the interaction of soil, vegetation, and environmental factors [54]. From the perspective of soil theory, soil salinity is essentially driven by the imbalance between salt accumulation and leaching in the soil profile: factors such as groundwater level, evaporation intensity, and vegetation cover directly regulate salt migration and distribution, while soil texture and organic matter content further modify the retention and movement of salt ions. In this context, to obtain results that better reflect real-world conditions, integrating multi-source RS data (encompassing soil, vegetation, and environmental variables) is not merely a technical choice but a necessary approach to align with the multi-factorial nature of salinization mechanisms. Such integration not only enhances model robustness but also improves the interpretability of the spatiotemporal distribution of soil properties—linking the observed RS signals to the underlying salinization processes [17,52].

Even single-temporal observations can generate hundreds of candidate features, but the inherent redundancy and collinearity among these features are not trivial; they stem from the intrinsic correlations between the ecological and physical processes underlying salinization. For instance, the strong collinearity observed between P4 and P1 (correlation coefficient = −0.99), NDVI and NDSI (correlation coefficient = −1), and SI5 and SR (correlation coefficient = 0.95) is not a random statistical phenomenon but a reflection of overlapping information in RS features that target similar salinization-related processes. NDVI and NDSI, for example, both respond to vegetation cover and canopy water content, which are closely tied to salt stress; thus, their inverse correlation arises from the opposing responses of vegetation vitality to salinization (NDVI decreases with increasing salinization, while NDSI increases). This collinearity not only leads to computational inefficiency but also obscures the true relationships between RS features and soil salinization, potentially introducing biases into the model that contradict salinization theory (e.g., overemphasizing redundant features that do not contribute to capturing salt migration mechanisms). Therefore, some researchers have recently employed correlation cluster analysis to summarize the similarity of information among RS features and categorized 57 remote sensing features into seven clusters while applying correlation threshold filtering to each cluster [51]. Our results further clarify the mechanism by which feature optimization addresses these challenges: both VIF and feature selection algorithms mitigate multicollinearity issues by eliminating redundant features, thereby retaining the features that directly or indirectly link to salt accumulation or vegetation response mechanisms. Importantly, feature subsets derived from feature selection algorithms yield superior predictive performance, as well as lower the number of features from 33 to 9. This optimization not only enhances computational efficiency but also ensures that the model focuses on the causal mechanisms of salinization rather than spurious correlations, making the prediction results more consistent with the actual spatiotemporal dynamics of soil salinization. Therefore, to meet the demands for high-precision and visualizable remote sensing monitoring of regional soil properties, the approach we employ—combining feature selection with machine learning and SHAP analysis—represents a practical solution.

4.2. Limitations of the Study and Future Directions

Bouasria et al. [39] argued that the reliability of machine learning predictions based on ground survey data depends on factors such as sample size, sampling design, and spatial extent. Similarly, Khaire and Dhanalakshmi [40] emphasized that the robustness of feature selection is influenced by multiple factors, including the size of the sample set and the variance of the data. Both studies highlight that the performance of feature selection and machine learning models is affected by the heterogeneity of the sample space. Although we employed multiple models to verify the consistency of our results, the findings of these studies underscore the importance of accounting for sample space heterogeneity when evaluating model performance and feature optimization outcomes. To address this issue, we validated our approach using two additional datasets at the regional scale, which exhibited significant differences from our study area, and obtained consistent results. This further proves the role of feature selection in regional-scale soil salinity assessment.

However, existing research shows that the spatial scale of the study area significantly impacts the accuracy of machine learning models [39]. But, considering that the primary objective of this study was to develop a high-precision, regional-scale online mapping strategy to support subsequent software development and applications, we did not extend the discussion to the national or global scales. Nonetheless, our preliminary experiments suggest that the process may behave differently at those broader scales. The issue of sample scale is also a key factor contributing to the current lack of a unified technical process for soil salinization research. Furthermore, the study only covers two typical regions (Almaty, Kazakhstan, and Shandong, China) and focuses on topsoil (0–20 cm) salinity. Further research is needed to establish the strategy’s universality under diverse climatic, soil, and vegetation conditions worldwide.

5. Conclusions

This study integrates multi-source satellite data with feature selection to develop and evaluate an efficient, interpretable framework for online soil salinity mapping. The main conclusions are as follows:

(1): Feature selection is critical for improving regional-scale soil salinity inversion: recursive feature elimination (RFE) significantly enhances model accuracy, while the variance inflation factor (VIF) approach, despite mitigating multicollinearity, reduces accuracy notably. Cross-site validation between Almaty (Kazakhstan) and Shandong (China) confirms the framework’s robustness and cross-regional transferability.
(2): Random forest (RF) outperforms other algorithms in salinity mapping, providing reliable accuracy and high-resolution spatial details. RFE- and SHAP-based analysis identifies CRSI, BI, MSAVI2, and elevation as core predictors, revealing the associations between salinization mechanisms, human cultivation improvement, and topography.

Overall, this study advances the understanding of multi-source remote sensing integration for soil mapping, clarifies feature selection’s role in model performance, and deepens insights into salinization drivers via key predictors, providing valuable methodological and technical support for exploring regional-scale soil salinization mapping. Future research should expand the study area to enhance the framework’s applicability.

Author Contributions

Conceptualization, Z.B. and J.P.; methodology, J.P.; software, S.S.; validation, Y.W.; formal analysis, S.S.; investigation, S.S., Y.W., J.W., J.Y. and Z.B.; resources, J.P.; data curation, S.S. and Y.W.; writing—original draft preparation, S.S.; writing—review and editing, Z.B. and J.P.; visualization, S.S.; supervision, J.P.; project administration, J.P.; funding acquisition, S.S., J.W. and J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by grants from the National Science Foundation of China (Grant No. 42261016, J.P.), Natural Science Support Program of XPCC (Grant No. 2025DA002, J.P.), Tarim University President’s Fund (TDZKSS202404, Z.B.), and the Doctoral Student Research and Innovation Program of Tarim University (Grant Nos. TDBSCX202415, J.W. and TDBSCX202501, S.S.).

Data Availability Statement

The data are available at https://github.com/nxysss0827/Paper.git (Accessed on 20 March 2026).

Acknowledgments

The authors thank the anonymous reviewers and academic editors for their comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hassani, A.; Azapagic, A.; Shokri, N. Global Predictions of Primary Soil Salinization under Changing Climate in the 21st Century. Nat. Commun. 2021, 12, 6663. [Google Scholar] [CrossRef]
Harper, R.J.; Dell, B.; Ruprecht, J.K.; Sochacki, S.J.; Smettem, K.R.J. Salinity and the Reclamation of Salinized Lands. In Soils and Landscape Restoration; Academic Press: Cambridge, MA, USA, 2021; pp. 193–208. [Google Scholar] [CrossRef]
Wuyun, D.; Bao, J.; Crusiol, L.G.T.; Wulan, T.; Sun, L.; Wu, S.; Xin, Q.; Sun, Z.; Chen, R.; Peng, J.; et al. Generating Salt-Affected Irrigated Cropland Map in an Arid and Semi-Arid Region Using Multi-Sensor Remote Sensing Data. Remote Sens. 2022, 14, 6010. [Google Scholar] [CrossRef]
Peng, J.; Ji, W.; Ma, Z.; Li, S.; Chen, S.; Zhou, L.; Shi, Z. Predicting Total Dissolved Salts and Soluble Ion Concentrations in Agricultural Soils Using Portable Visible Near-Infrared and Mid-Infrared Spectrometers. Biosyst. Eng. 2016, 152, 94–103. [Google Scholar] [CrossRef]
Garcia, C.; Hernandez, T. Influence of Salinity on the Biological and Biochemical Activity of a Calciorthird Soil. Plant Soil 1996, 178, 255–263. [Google Scholar] [CrossRef]
Rhoades, J.; Kandish, A.; Mashali, A. The Use of Saline Waters for Crop Production; FAO: Roma, Italy, 1992. [Google Scholar]
Keesstra, S.D.; Geissen, V.; Mosse, K.; Piiranen, S.; Scudiero, E.; Leistra, M.; van Schaik, L. Soil as a Filter for Groundwater Quality. Curr. Opin. Environ. Sustain. 2012, 4, 507–516. [Google Scholar] [CrossRef]
Decock, C.; Lee, J.; Necpalova, M.; Pereira, E.I.P.; Tendall, D.M.; Six, J. Mitigating N₂O Emissions from Soil: From Patching Leaks to Transformative Action. Soil 2015, 1, 687–694. [Google Scholar] [CrossRef]
Allbed, A.; Kumar, L. Soil Salinity Mapping and Monitoring in Arid and Semi-Arid Regions Using Remote Sensing Technology: A Review. Adv. Remote Sens. 2013, 2, 373–385. [Google Scholar] [CrossRef]
Aquastat FAO. FAO’s Global Information System on Water and Agriculture; FAO, Food and Agriculture Organization of the United Nations: Roma, Italy, 2011. [Google Scholar]
Ghassemi, F.; Jakeman, A.; Nix, H. Salinisation of Land and Water Resources: Human Causes, Extent, Management and Case Studies; CAB International: Wallingford, UK, 1995. [Google Scholar]
FAO. Global Status of Salt-Affected Soils—Main Report; FAO: Roma, Italy, 2024. [Google Scholar] [CrossRef]
Shao, H.; Chu, L.; Lu, H.; Qi, W.; Chen, X.; Liu, J.; Kuang, S.; Tang, B.; Wong, V. Towards Sustainable Agriculture for the Salt-Affected Soil. Land Degrad. Dev. 2019, 30, 574–579. [Google Scholar] [CrossRef]
Wang, J.; Ding, J.; Yu, D.; Teng, D.; He, B.; Chen, X.; Ge, X.; Zhang, Z.; Wang, Y.; Yang, X.; et al. Machine Learning-Based Detection of Soil Salinity in an Arid Desert Region, Northwest China: A Comparison between Landsat-8 OLI and Sentinel-2 MSI. Sci. Total Environ. 2020, 707, 136092. [Google Scholar] [CrossRef]
Jiang, X.; Ma, Y.; Li, G.; Huang, W.; Zhao, H.; Cao, G.; Wang, A. Spatial Distribution Characteristics of Soil Salt Ions in Tumushuke City, Xinjiang. Sustainability 2022, 14, 16486. [Google Scholar] [CrossRef]
Wang, F.; Yang, S.; Wei, Y.; Shi, Q.; Ding, J. Characterizing Soil Salinity at Multiple Depth Using Electromagnetic Induction and Remote Sensing Data with Random Forests: A Case Study in Tarim River Basin of Southern Xinjiang, China. Sci. Total Environ. 2021, 754, 142030. [Google Scholar] [CrossRef]
Bai, J.; Wang, N.; Hu, B.; Feng, C.; Wang, Y.; Peng, J.; Shi, Z. Integrating Multisource Information to Delineate Oasis Farmland Salinity Management Zones in Southern Xinjiang, China. Agric. Water Manag. 2023, 289, 108559. [Google Scholar] [CrossRef]
Sahbeni, G.; Ngabire, M.; Musyimi, P.K.; Székely, B. Challenges and Opportunities in Remote Sensing for Soil Salinization Mapping and Monitoring: A Review. Remote Sens. 2023, 15, 2540. [Google Scholar] [CrossRef]
Mukhamediev, R.I.; Merembayev, T.; Kuchin, Y.; Malakhov, D.; Zaitseva, E.; Levashenko, V.; Popova, Y.; Symagulov, A.; Sagatdinova, G.; Amirgaliyev, Y. Soil Salinity Estimation for South Kazakhstan Based on SAR Sentinel-1 and Landsat-8,9 OLI Data with Machine Learning Models. Remote Sens. 2023, 15, 4269. [Google Scholar] [CrossRef]
Allbed, A.; Kumar, L.; Sinha, P. Mapping and Modelling Spatial Variation in Soil Salinity in the Al Hassa Oasis Based on Remote Sensing Indicators and Regression Techniques. Remote Sens. 2014, 6, 1137–1157. [Google Scholar] [CrossRef]
Hoa, P.V.; Giang, N.V.; Binh, N.A.; Hai, L.V.H.; Pham, T.D.; Hasanlou, M.; Bui, D.T. Soil Salinity Mapping Using SAR Sentinel-1 Data and Advanced Machine Learning Algorithms: A Case Study at Ben Tre Province of the Mekong River Delta. Remote Sens. 2019, 11, 128. [Google Scholar] [CrossRef]
Tan, W.; Wang, X.; Yan, L.; Yi, J.; Xia, T.; Zeng, Z.; Yu, G.; Chai, M.; Velpuri, N.M.; Thaneerat, A. Mapping Rice-Crayfish Co-Culture (RCC) Fields with Sentinel-1 and -2 Time Series in China’s Primary Crayfish Production Region Jianghan Plain. Sci. Remote Sens. 2024, 10, 100151. [Google Scholar] [CrossRef]
Avdan, U.; Kaplan, G.; Küçük Matcı, D.; Yiğit Avdan, Z.; Erdem, F.; Tuğba Mızık, E.; Demirtaş, İ. Soil Salinity Prediction Models Constructed by Different Remote Sensors. Phys. Chem. Earth Parts A/B/C 2022, 128, 103230. [Google Scholar] [CrossRef]
Orynbaikyzy, A.; Plank, S.; Vetrita, Y.; Martinis, S.; Santoso, I.; Dwi Ismanto, R.; Chusnayah, F.; Tjahjaningsih, A.; Suwarsono; Genzano, N.; et al. Joint Use of Sentinel-2 and Sentinel-1 Data for Rapid Mapping of Volcanic Eruption Deposits in Southeast Asia. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103166. [Google Scholar] [CrossRef]
Ma, G.; Ding, J.; Han, L.; Zhang, Z.; Ran, S. Digital Mapping of Soil Salinization Based on Sentinel-1 and Sentinel-2 Data Combined with Machine Learning Algorithms. Reg. Sustain. 2021, 2, 177–188. [Google Scholar] [CrossRef]
Li, J.; Zhang, T.; Shao, Y.; Ju, Z. Comparing Machine Learning Algorithms for Soil Salinity Mapping Using Topographic Factors and Sentinel-1/2 Data: A Case Study in the Yellow River Delta of China. Remote Sens. 2023, 15, 2332. [Google Scholar] [CrossRef]
He, Y.; Zhang, Z.; Xiang, R.; Ding, B.; Du, R.; Yin, H.; Chen, Y.; Ba, Y. Monitoring Salinity in Bare Soil Based on Sentinel-1/2 Image Fusion and Machine Learning. Infrared Phys. Technol. 2023, 131, 104656. [Google Scholar] [CrossRef]
Taghadosi, M.M.; Hasanlou, M.; Eftekhari, K. Soil Salinity Mapping Using Dual-Polarized SAR Sentinel-1 Imagery. Int. J. Remote Sens. 2019, 40, 237–252. [Google Scholar] [CrossRef]
Pratama, B.A.S.; Danoedoro, P.; Arjasakusuma, S. Exploring Optimal Integration Schemes for Sentinel-1 SAR and Sentinel-2 Multispectral Data in Land Cover Mapping across Different Atmospheric Conditions. Remote Sens. Appl. Soc. Environ. 2024, 34, 101185. [Google Scholar] [CrossRef]
Das, K.; Twarakavi, N.; Khiripet, N.; Chattanrassamee, P.; Kijkullert, C. A Machine Learning Framework for Mapping Soil Nutrients with Multi-Source Data Fusion. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS 2021, Brussels, Belgium, 11–16 July 2021; pp. 3705–3708. [Google Scholar] [CrossRef]
Zhang, X.; Xue, J.; Chen, S.; Zhuo, Z.; Wang, Z.; Chen, X.; Xiao, Y.; Shi, Z. Improving Model Performance in Mapping Cropland Soil Organic Matter Using Time-Series Remote Sensing Data. J. Integr. Agric. 2024, 23, 2820–2841. [Google Scholar] [CrossRef]
Yuzugullu, O.; Fajraoui, N.; Liebisch, F. Soil Texture and PH Mapping Using Remote Sensing and Support Sampling. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12685–12705. [Google Scholar] [CrossRef]
Maynard, J.J.; Levi, M.R. Hyper-Temporal Remote Sensing for Digital Soil Mapping: Characterizing Soil-egetation Response to Climatic Variability. Geoderma 2017, 285, 94–109. [Google Scholar] [CrossRef]
Pelletier, N.; Millard, K.; Darling, S. Wildfire Likelihood in Canadian Treed Peatlands Based on Remote-Sensing Time-Series of Surface Conditions. Remote Sens. Environ. 2023, 296, 113747. [Google Scholar] [CrossRef]
Guo, B.; Yang, X.; Yang, M.; Sun, D.; Zhu, W.; Zhu, D.; Wang, J. Mapping Soil Salinity Using a Combination of Vegetation Index Time Series and Single-Temporal Remote Sensing Images in the Yellow River Delta, China. CATENA 2023, 231, 107313. [Google Scholar] [CrossRef]
Yuzugullu, O.; Fajraoui, N.; Don, A.; Liebisch, F. Satellite-Based Soil Organic Carbon Mapping on European Soils Using Available Datasets and Support Sampling. Sci. Remote Sens. 2024, 9, 100118. [Google Scholar] [CrossRef]
Wang, N.; Chen, S.; Huang, J.; Frappart, F.; Taghizadeh, R.; Zhang, X.; Wigneron, J.P.; Xue, J.; Xiao, Y.; Peng, J.; et al. Global Soil Salinity Estimation at 10 m Using Multi-Source Remote Sensing. J. Remote Sens. 2024, 4, 0130. [Google Scholar] [CrossRef]
Jia, P.; He, W.; Hu, Y.; Liang, Y.; Liang, Y.; Xue, L.; Zamanian, K.; Zhao, X. Inversion of Coastal Cultivated Soil Salt Content Based on Multi-Source Spectra and Environmental Variables. Soil Tillage Res. 2024, 241, 106124. [Google Scholar] [CrossRef]
Bouasria, A.; Bouslihim, Y.; Gupta, S.; Taghizadeh-Mehrjardi, R.; Hengl, T. Predictive Performance of Machine Learning Model with Varying Sampling Designs, Sample Sizes, and Spatial Extents. Ecol. Inform. 2023, 78, 102294. [Google Scholar] [CrossRef]
Khaire, U.M.; Dhanalakshmi, R. Stability of Feature Selection Algorithm: A Review. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
Zhang, Z.; He, Y.; Yin, H.; Xiang, R.; Chen, J.; Du, R. Synergistic Estimation of Soil Salinity Based on Sentinel-1/2 Improved Polarization Combination Index and Texture Features. Trans. Chin. Soc. Agric. Mach. 2023, 55, 175–185. [Google Scholar] [CrossRef]
Aihaiti, A.; Nurmemet, I.; Yu, X.; Aili, Y.; Li, S.; Lv, X.; Qin, Y. An Enhanced Soil Salinity Estimation Method for Arid Regions Using Multisource Remote Sensing Data and Advanced Feature Selection. CATENA 2025, 256, 109116. [Google Scholar] [CrossRef]
Wang, J.; Zhen, J.; Hu, W.; Chen, S.; Lizaga, I.; Zeraatpisheh, M.; Yang, X. Remote Sensing of Soil Degradation: Progress and Perspective. Int. Soil Water Conserv. Res. 2023, 11, 429–454. [Google Scholar] [CrossRef]
Shi, S.; Wang, N.; Chen, S.; Hu, B.; Peng, J.; Shi, Z. Digital Mapping of Soil Salinity with Time-Windows Features Optimization and Ensemble Learning Model. Ecol. Inform. 2025, 85, 102982. [Google Scholar] [CrossRef]
Wang, J.; Feng, C.; Hu, B.; Chen, S.; Hong, Y.; Arrouays, D.; Peng, J.; Shi, Z. A Novel Framework for Improving Soil Organic Matter Prediction Accuracy in Cropland by Integrating Soil, Vegetation and Human Activity Information. Sci. Total Environ. 2023, 903, 166112. [Google Scholar] [CrossRef]
Chi, Y.; Fan, M.; Zhang, Z.; Qu, Y. Zoning the Soil Salinization Levels in the Northern China’s Coastal Areas Based on High-Resolution Soil Mapping. Ecol. Indic. 2025, 172, 113303. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle Swarm Optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar] [CrossRef]
Han, Y.; Shi, P. An Improved Ant Colony Algorithm for Fuzzy Clustering in Image Segmentation. Neurocomputing 2007, 70, 665–671. [Google Scholar] [CrossRef]
Omuto, C.T.; Vargas, R.R.; El Mobarak, A.M.; Mohamed, N.; Viatkin, K.; Yigini, Y. Mapping of Salt-Affected Soils—Technical Manual; FAO: Roma, Italy, 2020. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, T.; Shamseldin, A.Y.; Tong, X.; Duan, L.; Jia, T.; Lun, S.; Zhang, S. A Synergistic UAV-Landsat Novel Strategy for Enhanced Estimation of above-Ground Biomass and Shrub Dominance in Sandy Land. Ecol. Inform. 2025, 90, 103282. [Google Scholar] [CrossRef]
Wang, S.; Li, Y.; Li, T.; Lu, W.; Qi, X.; Xie, X.; Sa, R.; Guo, T.; Pulatov, A.; Javlonbek, I.; et al. Regional Maize Suitability Based on Soil Water and Salt Content Inversion by Integrating Machine and Transfer Learnings in Xinjiang. Soil Tillage Res. 2025, 254, 106740. [Google Scholar] [CrossRef]
Romeo Jiménez, V.M.; Notario del Pino, J.S.; Fernández-Guisuraga, J.M.; Mejías Vera, M.Á. Prediction of Some Soil Properties in Volcanic Soils Using Random Forest Modeling: A Case Study at Chinyero Special Nature Reserve (Tenerife, Canary Islands). Ecol. Inform. 2025, 86, 103054. [Google Scholar] [CrossRef]
Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big Data for Remote Sensing: Challenges and Opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
Zhang, X.; Shu, C.; Wu, Y.; Ye, P.; Du, D. Advances of Coupled Water-Heat-Salt Theory and Test Techniques for Soils in Cold and Arid Regions: A Review. Geoderma 2023, 432, 116378. [Google Scholar] [CrossRef]

Figure 1. Flowchart of research techniques.

Figure 2. Study area: site 1 sample points for analysis and site 2 and site 3 sample points for validation.

Figure 3. Distribution of soil salinity samples across sampling sites.

Figure 4. Correlation coefficients between features. In this figure, (a) is the correlation among all features. The features selected by PSO, ACO, and Boruta are respectively shown in (b–d). And (e) is the feature derived from the VIF method. Red boxes highlight areas with high multicollinearity.

Figure 5. Scatter plot of predicted and true values for LR model and SHAP analyses corresponding to different processing methods. (a) Scatter plot of predicted and true values for LR model in calibration and validation sets using the full dataset. (b,c) are the SHAP summary plot and SHAP waterfall plot, respectively. (d–f) are the model results derived from the VIF method. (g–i) are the model results based on Boruta. (j–l) correspond to POS, and (m–o) are the results derived from ACO.

Figure 6. (a) is R-squared of the RF model and corresponds to different feature numbers. (b) is the contribution of optimal features derived from RFE, and (c) is their corresponding SHAP values.

Figure 7. Model accuracy corresponding to different modeling methods.

Figure 8. Moran’s I for features (a) and classification accuracy (b).

Figure 9. Maps of soil salinization levels for different algorithms on the GEE platform.

Figure 10. Spatial distribution of regional soil salinity at Kongtaileke Ranch based on RF model and optimal features (a) and prediction uncertainty (b).

Figure 11. Feature SHAP values corresponding to optimal results and model accuracy across the two transfer sites. (a) represents the calibration sets result for different feature sets at Site 2; (b) is the corresponding validation set results; and (c) is the SHAP value for the optimal features. (c–f) are the calibration and validation results, and SHAP values for the optimal features at Site 3, respectively.

Table 1. Soil sample descriptions for each site.

Site	Location	Date	Number	Sampling Depth
Site 1	Xinjiang, China	14 August 2021–17 August 2021	186	0–20 cm
Site 2	Almaty, Kazakhstan	23 May 2022–18 July 2022	207	0–20 cm
Site 3	Shandong, China	2018–2020	457	0–20 cm

Table 2. RS features extracted from GEE.

Categories	Acronym	Formula
SAR features	VV	VV
	VH	VH
	P1	VV + VH
	P2	VV2 + VH
	P3	VV2 + VH2
	P4	VH2 − VV
	P5	(VH2 + VV2)/VH
	P6	10 log(VV)
	P7	10 log(VH)
	P8	10 log(VV) + 10 log(VH)
Multispectral features	NDVI	(B8 − B4)/(B8 + B4)
	GNDVI	(B7 − B3)/(B7 + B3)
	WDVI	B8 − 0.5 × B4
	TNDVI	(0.5 + (B8 − B4/B8 + B4))0.5
	SAVI	((B8 − B4)/(B8 + B4 + 0.5)) × 1.5
	IPVI	B8/(B8 + B4)
	MCARI	B5 − B4 − 0.2 × (B5 − B3) × B5/B4
	REIP	(700 + 40((B4 + B7) × 0.5 − B5))/(B6 − B5)
	MSAVI2	0.5 × (2 × B8 + 1 − ((2 × B8 + 1)2 − 8 × (B8 − B4))0.5)
	DVI	B8 − B4
	NSI	(B11 − B12)/(B11 − B8)
	VSSI	2 × B3 − 5 × (B4 + B8)
	NDSI	(B4 − B8)/(B4 + B8)
	SR	(B3 − B4)/(B2 + B4)
	CRSI	(B8 × B4 − B3 × B2)/(B8 × B4 + B3 × B2)
	BI	(B42 + B82)0.5
	SI1	(B3 × B4)0.5
	SI2	(B4 × B2)0.5
	SI3	(B32 + B42)0.5
	SI4	(B8 × B11 − B112)/B8
	SI5	B2/B4
	SI6	B4 × B8/B3
Topographic features	DEM	SRTM DEM

Note. B2–B12 denote Sentinel-2 spectral bands. VV and VH represent the dual-polarization bands of Sentinel-1.

Table 3. Range of parameters for the models.

Models	Parameters
LR	fit_intercept = True, copy_X = True, n_jobs = 1
XGB	n_estimators = 200 (100–200), max_depth = 4 (1–4), learning_rate = 0.03 (0.001–0.03), reg_alpha = 1, reg_lambda = 10
PLSR	n_components = 3 (3–9)
KNN	Regression: n_neighbors = 10 (3–10), Classification = Default
RF	Regression: n_estimators = 20, max_depth = 2, min_samples_leaf = 1, min_samples_split = 2; Classification: numberOfTrees = 50, minLeafPopulation = 2
GTB	NumberOfTrees = 20
CART	MaxNodes = 1000, minLeafPopulation = 2
Naive Bayes	Default
SVM	Default

Table 4. Grading of soil salinity levels and sample numbers.

Intensity	Levels	Numbers
None	<0.75	32
Slight	0.75–2	16
Moderate	2–4	16
Strong	4–8	14
Very Strong	8–15	38
Extreme	>15	70

Table 5. RF model performance of different modeling approaches across the two transfer sites.

Regions	Methods	Calibration			Validation
Regions	Methods	R²	RMSE	LCCC	R²	RMSE	LCCC
Site 2	Full data	0.414	1.303 dS m⁻¹	0.309	0.237	0.800 dS m⁻¹	0.295
	RFE	0.461	1.250 dS m⁻¹	0.327	0.328	0.751 dS m⁻¹	0.328
	VIF	0.388	1.331 dS m⁻¹	0.244	0.148	0.846 dS m⁻¹	0.273
Site 3	Full data	0.742	1.240 g kg⁻¹	0.671	0.650	1.093 g kg⁻¹	0.563
	RFE	0.753	1.212 g kg⁻¹	0.693	0.704	1.005 g kg⁻¹	0.601
	VIF	0.654	1.436 g kg⁻¹	0.526	0.507	1.297 g kg⁻¹	0.407

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, S.; Wang, Y.; Wang, J.; Yang, J.; Bai, Z.; Peng, J. Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP. Remote Sens. 2026, 18, 955. https://doi.org/10.3390/rs18060955

AMA Style

Shi S, Wang Y, Wang J, Yang J, Bai Z, Peng J. Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP. Remote Sensing. 2026; 18(6):955. https://doi.org/10.3390/rs18060955

Chicago/Turabian Style

Shi, Shuaishuai, Yu Wang, Jiawen Wang, Jibang Yang, Zijin Bai, and Jie Peng. 2026. "Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP" Remote Sensing 18, no. 6: 955. https://doi.org/10.3390/rs18060955

APA Style

Shi, S., Wang, Y., Wang, J., Yang, J., Bai, Z., & Peng, J. (2026). Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP. Remote Sensing, 18(6), 955. https://doi.org/10.3390/rs18060955

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Soil Salinity Assessment and Cross-Regional Validation Based on Multiple Feature Optimization Methods and SHAP

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing

2.1.1. Soil Samples and Preprocessing

2.1.2. Image Processing and Feature Extraction

2.2. Feature Optimization Methods

2.2.1. Based on VIF Method

2.2.2. Boruta

2.2.3. RFE

2.2.4. PSO

2.2.5. ACO

2.3. Model and Model Assessments

2.3.1. Regression

2.3.2. Classification

2.3.3. Accuracy Assessment

2.4. Shapley Additive Explanation

3. Results

3.1. Descriptive Statistics of Soil Samples

3.2. Correlation Analysis and Feature Selection

3.3. Effect of Feature Selection on Model Accuracy

3.4. Qualitative and Quantitative Assessment of Soil Salinity

3.5. Validation of Methodology

4. Discussion

4.1. The Significance of Multi-Source Integration and Feature Optimization

4.2. Limitations of the Study and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI