1. Introduction
Soil salinization is a soil degradation phenomenon caused by human activities and natural disturbances. It has caused a severe impact on agricultural production and ecosystem balance in many countries of the world [
1,
2]. China is one of the countries affected by serious soil salinization, which has significantly reduced local agricultural productivity and economic benefits due to its extensive distribution area and long period of damage [
3,
4]. Soil quality is closely related to soil salinity. It is of great significance to obtain information on soil salinization both rapidly and accurately [
5,
6,
7].
In comparison with traditional field surveys, remote sensing technology has many advantages in providing technical support with a short revisit period, to obtain soil information over a long time and acquire much information at a low cost. Remote sensing technologies and methods have become of great interest for monitoring and evaluation of the soil salinity dynamic [
8,
9]; the research on soil salinity prediction using remote sensing images has shown great potential. In recent decades, increasing numbers of scholars have successfully applied optical remote sensing data to obtain information on soil salinity. For instance, El Harti et al. [
10] proposed a new soil salinity index using Landsat-OLI data to improve the precision of soil salinity inversion in the Moroccan irrigated area. Scudiero et al. [
11] observed that there is a strong correlation between the canopy response using Landsat-ETM+ data and the soil salinity content, and they proposed the canopy response salinity index (CRSI) to identify regions affected by soil salinization. Bannari et al. [
12] found that the SWIR bands from the Sentinel-2 satellite have more sensitivity to soil salinity, and can be used as excellent candidates for an integration in soil salinity modeling and monitoring. In addition, GIS technology combined with optical remote sensing technology using Landsat TM data has become a promising way to analyze the distribution of the dynamic changes in soil salinization [
13]. However, these approaches mostly use optical remote sensing data and rarely use polarimetric SAR, which contains more structural features [
14]. There are few studies extracting land salinization information using radar polarization decomposition technology, which is still in the preliminary stage [
15]. Microwave remote sensing is a common means of obtaining surface information. It has good penetrability and all-weather detection ability, which makes up for the shortcomings of optical remote sensing, and has certain advantages in terms of soil composition monitoring [
16]. Taghadosi [
17] used support vector regression (SVR) to analyze the texture information obtained from Sentinel-1 SAR data, and studied the method of directly correlating radar intensity and soil salinity. In the end, the research received good soil salinity inversion results. Liu et al. [
18] established the inversion model of surface soil salinity in the Hetao tank farm of Inner Mongolia by using the backscattering coefficient of a four-polarization radar based on Radarsat-2 data. Lasne et al. [
19] showed that for microwave frequencies of 1–7 GHz there is a more significant relationship between soil salinity and the imaginary part, while the real part has a better connection with soil water content. Therefore, the effective use of polarimetric SAR data can provide a new method to obtain information on soil salinity across a wide range, and can also dynamically offer timely technical support for agricultural production practices.
However, research on soil salinity monitoring using only remote sensing data usually has unsatisfactory accuracy, due to the complex causes of soil salinization, including human activities and natural factors [
20]. Soil salinity is also related to various environmental factors. Therefore, many scholars have also begun to select environmental variables from the perspective of soil genesis to retrieve information on soil salt content [
21]. The parent material, biological index, soil index and topographic parameters should be considered as factors for retrieving soil salt content [
22,
23]. Shahrayini et al. [
24] observed that topographic factors, such as vertical distance to channel network (VDCN), analytical hill-shading (AH), flow accumulation (FA) and topographic wetness index (TWI), have a strong influence on the prediction of soil salinization. The research results [
25] indicate that topographic factors make the most decisive contribution to soil salinity. Taghizadeh-Mehrjardi et al. [
26] determined that the TWI was the most important parameter in the 60–100 cm depth interval by using the regression tree model, with their results proving that with increasing soil depth, the terrain parameters become more important.
Therefore, to improve the accuracy of mapping soil salinization, this research attempted to consider the environmental variables selected from the perspective of soil genesis. On this basis, four methods—stepwise multiple regression, support vector machine, random forest and partial least squares—were used to study soil salinity, in combination with radar data. Sidike et al. used the partial least squares regression method for soil salinity estimation in the Pingluo Region of China, and obtained a better prediction accuracy than stepwise regression [
27]. The partial least squares regression method has also been used to establish an excellent relationship between soil EC measurements and reflection spectra [
28]. Nurmemet et al. [
15] used the support vector machine for monitoring soil salinization in northwest China by using fused data, including Landsat ETM+, Radarsat-2 and PALSAR. The results show that the support vector machine is an excellent method for monitoring soil salinization. Wang et al. [
25] integrated remote sensing data and landscape characteristics by using four methods (PLSR, SVM, CNN and RF) to monitor soil affected by salt in southern Xinjiang, concluding that the random forest model had the best regression performance for mapping soil salinity in this region.
Furthermore, owing to many parameters and the lack of a feature-filtering function of some of the selected machine learning algorithms, information redundancy occurs in the process of modeling. Therefore, it is particularly crucial to choose an appropriate feature-filtering algorithm to improve the performance of the model further. Currently, several variable screening methods, including the Pearson correlation coefficient, grey correlation analysis, ridge regression and optimal subset method, have been widely used to improve data-mining performance. Some rare feature-filtering methods include the genetic algorithm (GA) and random forest (RF). Allbed et al. [
29] used Pearson correlation analysis to screen the characteristics. Zhao et al. [
30] analyzed the combination of UAV multi-spectral image features by using the Optimum Index Factor (OIF) method to obtain the best band combination for ground feature classification in the study area, and the results significantly improved the classification accuracy. Chen et al. [
31] used the grey correlation analysis method to screen the spectral index derived from the multi-spectral camera, with the conclusion that the model optimized by the grey correlation had a better performance than the original vegetation index group. Xu et al. [
32] proposed a new model of support vector regression (AGA-SVR) based on an adaptive genetic algorithm. The results show that compared with other models, support vector regression (GA-SVR) based on the genetic algorithm and AGA-SVR could obtain more accurate soil salinization information, with fewer characteristic parameters. Some information that is not based on specific model assumptions, only on thresholding for making data-driven decisions, was provided by the feature importance score based on experimentation, using a random forest model [
33].
Generally, increasing numbers of scholars are using optical or radar data to retrieve soil salinity. However, few studies choose environmental variables from the perspective of soil genesis to monitor soil information in combination with optical or radar data. This research attempted to screen the optimal combination of variables based on the random forest method of feature importance, using environmental factors derived from DEM data and optical or radar indices derived from Sentinel-1/2 remote sensing imagery data to map the spatial distribution of soil salinization in Shandong Province, in combination with four machine learning algorithms. In this study, we compared and analyzed the most suitable model for this region among the four models and selected the optimal model to monitor and obtain soil salinization information. The proportion of salt-affected soil area was considered to provide scientific guidance for the environmental management and ecological protection of salinized soil in the area.
4. Discussion
Soil salinization is a global environmental threat to ecosystem balance and agricultural production. Therefore, it is critical to monitor soil salinization in vulnerable areas. In this study, we used remote sensing data and topographic factors to obtain soil salinity information. Radar data can penetrate the Earth’s surface and play a particular role in monitoring soil salinity. The topographic factors were selected in this paper, due to terrain being one of the factors affecting soil formation. This experiment solved this problem well, by obtaining the soil salinization information based on the characteristic variables derived from Sentinel-1/2 and DEM data, using four regression methods for analysis. The experiment combined the advantages of topographic factors and remote sensing data, which can be used to screen the most suitable and influential characteristic parameters in this area and select the optimal model for regression analysis to create a spatial distribution map of soil salinization in this area. As a part of this study, we used the feature importance of the RF model to screen feature parameters. The method can demonstrate the influence of a single feature and the feature importance among variables. The results also show that the RF method is a good option for feature screening. The results obtained in this research are summarized and presented in detail in the following sections:
In general, soil salinization is affected by many factors, including terrain, climate, biology and parent material. There are many interference items for soil salinity mapping [
8]. The experimental results also prove this view. For example, among all the radar parameters decomposed, only VV and entropy are selected as the selected feature variables, and they are also low in the feature importance ranking of the RF model. Among the ten features selected, the VV (0.165) is sixth in the sorting, and the entropy (0.069) is ninth. The correlation coefficients of the PLSR model also prove that the effect of the radar parameters on the model might not be ideal. In the existing studies, most of them only consider remote sensing image data or DEM data to retrieve the soil salt content [
27,
57]. For example, Zhang [
9] achieved good results in retrieving soil salinity using 10 radar remote sensing features extracted from Sentinel-1 SAR image data. Vermeulen [
58] observed that there was great potential to monitor the accumulation of salt in irrigation areas using DEM and its derivatives, combined with four machine learning algorithms. However, in this paper, the accuracy of the four models were not ideal when only considering remote sensing image features or terrain factors for predicting soil salt content. Therefore, this paper used remote sensing image characteristics and topographic factors to retrieve the soil salt content in the study area. The results showed that the accuracy of the four models in this paper were significantly improved when all these factors were considered to predict the soil salt content, and that the best one was the PLSR model (R
2 = 0.66, RMSE = 1.30). Therefore, it is suggested that various factors are comprehensively considered to invert the soil salinity in order to obtain a model with a higher performance.
It can be seen from the importance analysis of the screened characteristic parameters in
Figure 4 that the CRSI is the parameter with the most significant influence, followed by the salinity index SI6. The topographic factors and radar-derived products have little influence on soil salinity in this area. This result indicates that the parameters derived from the optical data are the most suitable factors for soil salinity modeling in the local area, which confirms some of the above views (it is difficult to monitor the soil salinity using only the radar data).
In this study, four regression models (SMR, SVR, RFR and PLSR) were used to monitor soil salinity, and they have been widely used to obtain information on soil salinity. It should be noted from the results of this study that the testing set of the PLSR model has the highest accuracy (R2 = 0.66, RMSE = 1.30), and is the most suitable model for soil salt inversion in this region. Compared with the PLSR model, the other three models have a lower accuracy: the RFR model (R2 = 0.63, RMSE = 1.33), the SMR model (R2 = 0.51, RMSE = 1.38) and the SVR model with the lowest accuracy (R2 = 0.40, RMSE = 0.29). The results obtained show that the R2 of the training set of the RFR model reaches 0.76. The performance of the testing set of the RFR model is low (R2 = 0.63), with a lower accuracy than the testing set of the PLSR model (R2 = 0.66). The results indicate that the models in this experiment have a certain degree of overfitting, which is caused by the relatively small data sample.
It can be seen from the salinization distribution map in the study area that the non-salinity soil is mainly distributed in the eastern part of the region, accounting for 64.2% of the area, representing the largest percentage. Soil with slight salinization was distributed in the west and north, accounting for 29.2% of the area. Soil with moderate salinization was scattered across the study area, accounting for 6.2% of the total area. Only a small amount of soil with severe salinization was distributed in the southeast and north, accounting for only 0.4% of the area. There was no soil with severe salinization in the study area. The proportion of each salinization degree was consistent with the sample data and the reference data. It can also be seen that the areas with high-salinization soil are mainly distributed around the residential areas and the water bodies, and are caused by low topography and shallow groundwater. Due to high topography, the soil salt content in the northeast region of the map is low.
Finally, the results indicate that more accurate models and spatial maps of soil salinity could be generated by combining the topographic factors selected from the perspective of soil genetics with optical or radar data. The results also show that machine learning methods are an effective tool for obtaining information on soil composition.
5. Conclusions
In this study, four methods (SR, SVR, RF and PLSR) were used to predict the spatial distribution of surface soil salt in this region by using topographic factors, vegetation indices, salinity indices and polarization decomposition parameters. To solve the problem of feature redundancy in the process of modeling, this study adopted the feature importance of the RF model to screen all features to reduce feature redundancy and select more effective feature variables. The results show that the CRSI index contributed the most, which was consistent with other findings, indicating that it was feasible to use the feature importance of the RF model to screen features. The results show that the PLSR model has a better performance than the other three models (SR, SVM and RF), and it can describe the local variation in soil salinity in more detail. The prediction accuracy of the PLSR method was the highest in the testing set, with an R2 of 0.66 and RMSE of 1.30 g/kg, indicating that the PLSR model is feasible for predicting soil salinity information. According to the soil salt distribution map, the results in terms of soil salt inversion are consistent with the existing data. The level of soil salt near water bodies and tidal flats is higher, and in woodland and farmland it is lower. The inversion of soil salinity with high precision is still restricted by many factors, such as the number of samples, the climate data during the sampling period, the accuracy of land use type and the vegetation type and density. To obtain a better result of soil salinity inversion, the above factors should be fully considered in future work, to improve the inversion accuracy. In future research, we will also consider combining the advantages of the environmental factors and remote sensing data to find the most influential environmental factors and remote sensing parameters in the study area. For example, the quad-polarimetric SAR data can be selected to monitor soil salinity in future research, which will significantly increase the number of feature variables, including more textural and spatial information. If conditions permit, the decomposition parameters of the SAR data in different frequency bands (such as L, C and X) can also be increased to model soil salinity, which will significantly increase the potential of soil salt inversion. This study provides a basis for the further promotion of salinization monitoring and the selection of more effective characteristic variables, which provides a reference for land utilization and agricultural production in future study.