Digital Mapping of Soil Organic Carbon Using Sentinel Series Data: A Case Study of the Ebinur Lake Watershed in Xinjiang

: As an important evaluation index of soil quality, soil organic carbon (SOC) plays an important role in soil health, ecological security, soil material cycle and global climate cycle. The use of multi-source remote sensing on soil organic carbon distribution has a certain auxiliary effect on the study of soil organic carbon storage and the regional ecological cycle. However, the study on SOC distribution in Ebinur Lake Basin in arid and semi-arid regions is limited to the mapping of measured data, and the soil mapping of SOC using remote sensing data needs to be studied. Whether different machine learning methods can improve prediction accuracy in mapping process is less studied in arid areas. Based on that, combined with the proposed problems, this study selected the typical area of the Ebinur Lake Basin in the arid region as the study area, took the sentinel data as the main data source, and used the Sentinel-1A (radar data), the Sentinel-2A and the Sentinel-3A (multispectral data), combined with 16 kinds of DEM derivatives and climate data (annual average temperature MAT, annual average precipitation MAP) as analysis. The ﬁve different types of data are reconstructed by spatial data and divided into four spatial resolutions (10, 100, 300, and 500 m). Seven models are constructed and predicted by machine learning methods RF and Cubist. The results show that the prediction accuracy of RF model is better than that of Cubist model, indicating that RF model is more suitable for small areas in arid areas. Among the three data sources, Sentinel-1A has the highest SOC prediction accuracy of 0.391 at 10 m resolution under the RF model. The results of the importance of environmental variables show that the importance of Flow Accumulation is higher in the RF model and the importance of SLOP in the DEM derivative is higher in the Cubist model. In the prediction results, SOC is mainly distributed in oasis and regions with more human activities, while SOC is less distributed in other regions. This study provides a certain reference value for the prediction of small-scale soil organic carbon spatial distribution by means of remote sensing and environmental factors.


Introduction
Soil organic carbon (SOC), as the main carbon pool on the land surface, plays an important role in the interactive process of carbon cycle in long time series [1]. Natural and anthropogenic factors within a small area can directly cause changes in SOC and also indirectly affect changes in the carbon cycle within the area [2]. Small changes in SOC can cause changes in atmospheric CO2 and thus affect a series of global climate changes [3]. Global climate change is a serious problem that threatens the security of Earth's system and human health today [4]. In order to predict climate change in advance and to find more effective ways to control the direction of climate change, research on the prediction of land surface SOC that causes climate change is an essential step. Soil digital mapping and prediction of soil organic carbon is very important to provide a basis for the whole terrestrial carbon cycle ecosystem and future climate change. Based on this, we have a more comprehensive mapping of soil organic carbon, a more comprehensive understanding of soil carbon storage, but also to promote the global carbon cycle and carbon economy to bring some effect [5].
In the sampling and measurement process of soil organic carbon extraction, the traditional method will spend a lot of manpower and material resources, and through a series of traditional calculation methods [6,7], the accurate distribution of soil organic carbon in a wide range can be obtained [8]. With the wide application of soil digital mapping (SDM), a fast and accurate method for mapping can be provided on the basis of data mining [9]. This method has been widely applied to other soil properties [10]. Applying this method to the monitoring of soil organic carbon greatly improves the efficiency of mapping and monitoring, thus providing a method to reduce the cost of sampling and analysis [11]. The influence of environmental variables on soil properties is not considered in traditional soil mapping. This is the traditional method that only considers soil properties but does not take into account the influence of different climate, topography and other factors on soil properties [12,13]. The selection of environmental covariates can directly affect the distribution of soil organic carbon [14,15]. In a certain area, the change of soil organic carbon is affected by topography, climate, land use, vegetation cover and soil parent material [16]. Temperature and precipitation are highly important climatic factors in the prediction of soil carbon content [17,18]. Terrain attributes are also used as the main predictors of soil organic carbon to study mapping. With the change of terrain attributes, a series of changes will occur in the prediction results and the importance of variables. The importance of selecting covariates is also an essential part of soil organic carbon mapping. Xinjiang is located in arid areas, the distribution of precipitation is uneven [19], climate extremes and terrain fluctuations are obvious [20,21]. Climate and terrain environmental factors have a great impact on the distribution of soil organic carbon, and have a certain leading role in the spatial distribution of SOC.
In the development process of soil mapping, with the development of remote sensing technology and spatial information technology, it has experienced a rapid development process from the initial prediction of distribution through a simple regression model [9,22,23] to the analysis of spatial difference [24,25] to the current digital spatial mapping through machine learning [26,27]. In the development of digital mapping technology, machine learning is a common mapping method. The common methods are RF, Cubist, SVM, BRT, etc. Each method has its own shortcomings, and the appropriate model is selected according to the size of different regions [11,28,29].
With the continuous progress of optical sensors, from the nearest and widely used Landsat series to the Sentinel series [30,31], the spectral resolution is increasing, and the available information is also increasing. The properties and optical characteristics of monitoring surface soil data are continuously improving [32]. In order to ensure the formation of a long-term global earth observation system, the European Space Agency (ESA) has formulated the Sentinel Series Satellite Program. Sentinel-1A Earth observation satellite (SAR satellite) was launched in 2014, Sentinel-2A high resolution multispectral imaging satellite and Sentinel-3A environmental monitoring satellite, were continuously launched from 2015 to 2016 and provided free data [33]. Sentinel-1A data, mainly through their penetration characteristics of soil organic carbon detection, were effective; Sentinels-2A and 3A, through time complementary revisit cycle [34,35] and the combination of different spatial resolutions from high to low at the same time, have many advantages in mapping (fast, time saving, large coverage area, etc.) [36]. The combination of the three not only avoids the error caused by time, but also combines SAR data with spectral data for high-precision multi-time relative SOC prediction, which improves the prediction accuracy and ability. Nowadays, different remote sensing satellites have been launched one after another, and there are many data detected by different sensors. To combine so many sensor data require the use of multi-scale data for soil mapping [37,38]. In different scales, by discussing the selection of optimal scale data for different research areas, the scale selection is used to improve the mapping accuracy to a better level [39,40].
Most soil mapping studies mainly investigate the distribution of soil properties by one kind of satellite data [41][42][43], which weakens the excellent effect of different kinds of satellite data on the extraction of surface parameters. There are some studies to obtain certain results on the distribution and content of surface SOC by Sentinel-1A, Sentinel-2A, and Sentinel-3A data, respectively [30,44], which are limited to one and two kinds of satellite data for analysis during the study [45][46][47], and there are fewer studies to combine the three kinds of data. Meanwhile, the differences of different satellite data at different scales also affect the accuracy of SOC prediction [48]. Therefore, the spatial variation of SOC under different satellite data and different scales needs to be considered comprehensively.
The main purpose of this study is to use two machine learning methods (RF, Cubist) to comprehensively analyze the spatial digital mapping ability of three sentinel sensors (S-1, S-2, S-3) to predict SOC in typical arid areas under two environmental variables (DEM derivatives, climate data), especially under three different types of remote sensing data. At the same time, four spatial resolutions (10, 100, 300, and 500 m) are used to create the SOC prediction model to compare and analyze the prediction effect of different spatial resolution data. The importance of the selected environmental variables was evaluated, and the potential of SOC prediction ability under different environmental variables and different spatial resolution was explored.

Study Area
Ebinur Lake Basin is located in the hinterland of Eurasia, and is located in Bortala Mongolian Autonomous Prefecture, Xinjiang, China (44°2′ N-45°23′ N, 79°53′ E-83°53′ E) ( Figure 1). The elevation difference in the basin is large (4713 m), and the terrain of Ebinur Lake is the lowest. The north-south elevation increases in turn to form a mountain environment. The south is the west edge of Tianshan Mountains, and the east has the largest wind outlet in Asia, Alashankou, forming an extreme terrain environment of wind funnel shape [49]. Affected by the temperate continental arid climate and regional topography, the regional annual average temperature (MAT) was 8.0 °C, the annual average precipitation (MAP) was 89.9-169.7 mm, and the annual average sunshine duration was 2696 h [50]. Due to the strong evaporation in arid areas reaching 1500-2000 mm, the soil level has obvious changes in dry and wet seasons [51,52]. The main types of soil are black calcium soil, chestnut soil, brown desert soil, gray desert soil and gray calcium soil in arid areas. With the change of climate and hydrothermal conditions, SOC has the highest correlation with saline soils, which increases with the increase of soil salinization. In contrast, SOC correlates less with gray-brown desert soils and windsanded soils in deserts. Since the study area is located in an arid zone, there is a clear correlation between the change of SOC and the management of soil salinization in the arid zone, thus making the SOC distribution in the area have a certain regional distribution status.

Soil Data Source
Soil data from the team field sampling data include different land use types and different soil texture surface data (0-10cm). There were 105 samples in total. After the data were brought back to the laboratory for pretreatment and soil organic carbon experiment, the experiment mainly includes: (1) pre-consolidation of soil-the field sampling soil data were brought back to the laboratory, and the soil was naturally dried in a cool environment at room temperature of 25-30 °C. The dry soil was removed and carefully ground, and then the soil was passed through 100 mesh sieve (0.149 mm) to clean up the soil samples; (2) Sample preprocessing-in the determination of SOC, in order to eliminate the possible salt effect of some samples in this study area, the soil samples were pretreated with hydrochloric acid for leaching treatment [53]; (3) Determination of soil organic carbon-the measurement was carried out by potassium dichromate oxidationexternal heating [54].

Topographic Variables
The selected Topographic variables are DEM data with a spatial resolution of 90 m obtained by extracting data from Shuttle Radar Topographic Mission (SRTM). Based on the data needs of this study, the data were processed and the data covering the whole study area were cut. Through SAGA GIS [55], 16 kinds of terrain data are obtained on the basis of DEM. Table 1 shows the brief description and reference methods of these data. The terrain data are divided into three types. The first type is the basic data type, the second type is based on regional terrain parameters, and the third type is a combination type, which combines the first two types of terrain parameters [56].

Remote Sensing Variables and Processing
The selected data of Sentinel-1A, Sentinel-2A and Sentinel-3A were from ESA (https://scihub.copernicus.eu/ (accessed on 10 January 2021)) [33]. Sentinel-1A is a 5.4 GHz SAR sensor carried by AB dual-satellite system, which can continuously detect the earth in many phases. The data are collected through four polarization modes (VV, HH, VH, and HV) to monitor the data. In the scene data within the coverage of the study area, we select the data with the spatial resolution of 2.3 m × 14 m in the interference width (IW) mode and the field angle of 250 km, and the central polarization modes are VV and VH polarization modes (Table 2), and extract the backscattering coefficient [65,66]. The Radar module in SNAP software was used for data processing. After GRD data → speckle filtering → radiometric calibration → geographic coding → data output, a series of data processing was performed to obtain available data [67]. Sentinel-2A data includes three different spatial resolution data 10 m, 20 m, 60 m and contains 12 spectral bands ( Table 3). The visible band is B2, B3, B4, and B5, the two vegetation red band is (B6, B7, and B8a), the near infrared band is B11, B12。Sentinel-2A products undergo radiometric calibration and atmospheric correction preprocessing using the sen2cor model in SNAP software [68].
Sentinel-3A data are mainly extracted from the land data through 21 spectral bands (400-1020 nm) in the ocean and land color instrument (OLCI) sensor. The spectral width is 1270 km and the spatial resolution is 300 m × 300 m. Sentinel-3A data were preprocessed by the ENVI 5.5 module for geometric positioning, radiometric calibration and atmospheric correctio [47]. It included 21 bands of data. Optical instruments including ocean and land color imaging spectrometer (OLCI) can achieve ocean revisit cycle less than 3.8 days and land revisit cycle less than 1.4 days. Based on the meteorological data provided by China Meteorological Data Sharing Network (https://data.cma.cn/ (accessed on 10 January 2021)) for nearly 50 years (1961-2010), this paper provides effective data for MAT and MAP in the study area [69]. Climate data are extracted from meteorological observation stations established in China and meteorological monitoring stations in surrounding areas, which provides the data basis for the accuracy of meteorological data [69].

Modelling Techniques
The machine learning model is a relatively mature model. In this paper, the main machine learning models are selected to predict and model soil organic carbon, which can obtain effective results.

Random Forest
Random forest (RF) is a comprehensive method based on decision tree that integrates classification and regression prediction [70]. It can avoid overfitting and low precision caused by single decision tree [71]. The sub-samples in the training set are extracted to construct the trees for "training". Different trees have different data subsets. In the process of "training", different trees are extracted, respectively, and cross-validation is carried out in the interior to improve the accuracy of the samples and to extract the respective characteristics of all the samples [72]. This method not only optimizes the selectivity of data prediction variables, but also predicts the results with higher accuracy and better balance than the decision tree algorithm [73]. The random forest algorithm will actively reduce noise and will not overtrain. It also shows good results in the modeling combined with environmental variables, and has been maturely applied to soil mapping and soil organic carbon mapping [74].

Cubist
Cubist is a rule-based nonparametric regression tree method; its advantage lies in its ability to perform in-depth data mining. In the construction of a regression tree, different leaf nodes are trained by first dividing and then processing data, which forms a multivariate data model [75]. In the calculation, multiple models will continuously calibrate the established models to reduce the data redundancy and reduce the data set change of each leaf node [76][77][78]. In the data set of different leaf nodes, the relationship between the data is established by stepwise multiple regression method, and then the data are combined. The final sub-data can be combined to obtain high-accuracy model results [79]. In the process of operation, the parameters of Cubist model can control the size change of decision tree in learning. This process is used to reduce and simplify the minimum data observed in the rules, which plays a role in narrowing the scope [80]. At the same time, the combination of the Cubist model and other machine learning models can greatly improve the accuracy of model data. The Cubist model has good performance in modeling and mapping of multiple types of soil properties in previous studies [81][82][83], and this method can combine the data of soil environmental variables and soil organic carbon data in regression technology to establish a learning relationship.

Model Calibration and Validation
In this paper, the mapping of SOC is studied, and the model is constructed by machine learning methods RF and Cubist. The model mainly includes SAR data, optical data, DEM derivatives and climate data, and then discusses the prediction ability and variable importance of SOC under different data combinations. The main models include: Model A, B, C are Sentinel-1A, Sentinel-2A, Sentinel-3A data for data modeling, Model D is the combination of DEM derivatives and climate data, Model E and Model F are the combination of SAR (Sentinel-1A) and optical data (Sentinel-2A, Sentinel-3A) and DEM derivatives and climate data. Model G is a model that combines all data. In the modeling set of model evaluation, we selected 75 % soil organic carbon data for training, and verified the remaining 25% data. 10-fold cross-validation of 75% of the data. Ten data sets were selected for internal modeling and cross-validation to improve data accuracy [84]. Based on this, we selected four prediction indexes to determine the prediction effect of each index-mainly including the determination coefficient (R 2 ) root mean square error (RMSE) and Mean Absolute Error (MAE) and Lin's Concordance Correlation Coefficient (LCCC)-and to determine the prediction accuracy of the data [85]. The formula is as follows： In the above formula, n is the number of selected samples, O is the average value of the observation value, Oi is the observation value of the i point, P is the average value of the predicted value, Pi is the predicted value of the i point, r is the correlation between the measured value and the model value (Pearson correlation coefficient), and the standard deviation of the measured value and the standard deviation of the simulated value. The range of LCCC is from 0 to 1, and the larger the value, the better the fitting effect. The correlation curve between the simulated value and the measured value is closer to the 1: 1 line [86].

Descriptive Analysis of SOC and Environment Variables
The values of statistical characteristics of our measured soil organic carbon and environmental variables are presented in Table 4. The measured soil organic carbon content shows a skewed normal distribution (Skewness is 1.143) with a data interval of 0.120-43.125, a mean value of 13.421, a median value of 2.384, and a standard deviation of 7.785. In this paper, variables from five types of data sources were selected for data statistics, and the statistical results showed that the selected remote sensing data bands were all consistent with the band data distribution of radar data and spectral data.

Evaluation and Comparison of Different Models
In this study, soil organic carbon was predicted on the basis of Sentinel-1A/2A/3A data and modeled using a combination of SAR data and optical spectral data at different resolutions, respectively, to build seven models (Model A: Sentinel-1A data; Model B: Sentinel-2A data; Model C: Sentinel-3A data; Model D: DEM derivatives and climate data; Model E: Sentinel-1A, DEM derivatives and climate data; Model F: Sentinel 2A/3A, DEM derivatives data and climate data; Model G: all data. Soil organic carbon was predicted by two machine learning models: RF and Cubist, and Table 5 shows the predictive ability judgment of the seven groups of models. Different models showed different results at different spatial resolutions, and there were significant differences in SOC with different sensors, and the combination of sensors and Dem derivatives and climate data. In the first 3 models, the spatial predictive capability is shown for Sentinel-1A,2A,3A data, respectively, and it can be seen from the table that, overall, the RF accuracy is higher than Cubist. At four different spatial resolutions, based on different satellite data and at different spatial resolutions, the best combined prediction is for Sentinel-1A (Model A), followed by Sentinel-2A (Model B) data and Sentinel-3 (Model C). In Sentinel-1A data prediction, the RF model at 10 m (R 2 = 0.391, MAE = 0.123, RMSE = 6.438, and LCCC = 0.401) and Cubist model at 10 m (R 2 = 0.335, MAE = 0.275, RMSE = 4.883, LCCC = 0.304) were tested with increasing resolution accuracy. The MAE interval is between 0.123 and 0.217, the RMSE is 6.805 at a maximum resolution of 100 m, and the LCCC shows the best fit at 10 m. In the prediction of Sentinel-2A data, the best prediction effect was achieved at 10 m resolution, R 2 = 0.383, MAE = 0.372, RMSE = 6.766 and LCCC = 0.324 in the RF model, and the R 2 gradually decreased with the increase of resolution, and the prediction effect was more similar after 100 and 300 m resolution. In the prediction of Sentinel-3A data, the best prediction effect at 500 m R 2 = 0.373, MAE = 0.220, RMSE = 7.196, LCCC = 0.292. The prediction accuracy of RF and Cubist models varied the same, but the Cubist model R 2 = 0.367, MAE = 0.145, RMSE = 7.582, LCCC = 0.250, both predictions are more similar. It indicates that the Sentinel-1A data have the best prediction at 10 m. Additionally, dem derivatives combined with climate data to form model D, RF at 300 m accuracy highest R 2 = 0.400, Cubist at 500 m accuracy highest R 2 = 0.311. The results show that the prediction effect of different sensors and different data satellites on SOC in sentinel data is better at the optimal resolution accuracy effect, separate prediction should be selected for the dominant resolution for prediction, and can see the change of the basic prediction performance of the data with the increase of the spectral resolution, the data due to the different resolution of each can reach a high accuracy prediction under the resolution related to the data itself.
S-1 and DEM derivatives and climate data were combined to form model E. The overall model accuracy was low, and the accuracy of 300 m (R 2 = 0.383, MAE = 0.385, RMSE = 7.975, LCCC = 0.267) was higher among the 4 resolution RF model predictions, and the accuracy of the other 3 resolution predictions was slightly lower. In the Cubist model, the highest prediction capability at 10 m is R 2 = 0.339, MAE = 0.260, RMSR = 6.994, LCCC = 0.189, which is higher than the other three resolution prediction accuracies. S-2, S-3 and DEM derivatives and climate data are combined to form model F. Both models have the highest accuracy at 100 m. RF model R 2 = 0.397, MAE = 0.361, RMSR = 7.598, LCCC = 0.171, Cubist model R 2 = 0.327, MAE = 0.300, RMSE = 6.494, LCCC = 0.210. The prediction performance in combining radar data (S-1) with environmental variables data (model E) RF is affected by environmental variables, while the Cubist model prediction is not affected. In combining multispectral data (S-2, S-3) with environmental variables in the prediction (model F), the prediction performance RF is better than Cubist.
Overall, the prediction for soil organic carbon showed that the overall predictive ability of the RF model was higher than that of the Cubist model, and among the seven models established for different data types and spectral resolutions. The most effective model is model G, which is a combination of SAR data, spectral data and all environmental variables. In this model, the RF modeling effect is the best at 10 m, R 2 = 0.406, MAE = 0.162, REMS = 5.947, and LCCC = 0.266. The overall trends of RF and Cubist were more consistent, and some differences were realized for data of different resolutions. It highlights the influence of different models on the predictive ability of different data. Different sensors have different sensitivities to soil organic carbon and also show significant differences at different spatial resolutions. Sensors with higher precision can predict soil properties better in high spatial resolution, and conversely low precision sensors predict less well.

Importance Analysis of Environmental Variables
In this paper, all the selected link variables are correlated with SOC, and the importance ranking of the impact on SOC is analyzed (Figure 2). In combination with two models, RF and Cubist, the importance ranking of the two models is obtained. In the RF model (Figure 2a), the top three importances are Flow Accumulation, aspect, and twi, indicating that terrain is very important for prediction. The influence order of three different sensors is S-2 > S-3 > S-1. In terms of the proportion of importance (Figure 2b), DEM derivatives account for 54.96% of all levels of importance, followed by Sentinel-2A (19.85%), Sentinel-3A (14.26%), Sentinel-1A and climate (5.5% and 8.43%). In the Cubist model (Figure 2c,d), DEM derivatives accounted for 53.34%, but the proportion of Sentinel-3A was 17.02%, Sentinel-2A was 16.36%, Sentinel-1A and climate accounted for 8.36% and 4.92%, respectively. The most important DEM derivatives data were SLOP, LS and chnl _ alti. In the two models, DEM derivatives account for a large proportion of prediction, and the difference is that the proportion of Cubist is greater than that of RF. Among the sentinel data sources, Sentinel-2A has the greatest impact in the RF model, but Sentinel-3A accounts for the most important proportion in Cubist, and the climate proportion RF is greater than Cubist. It shows that the proportion of environmental variables is large under different models, and different data sources are selected by different models, indicating that the selection of models will also affect the prediction of SOC to a certain extent. It shows that the proportion of environmental variables is large under different models, and different data sources are selected by different models, indicating that the selection of models will also affect the prediction of SOC to a certain extent.

Spatial Prediction Results of SOC
The RF model has the highest prediction accuracy, and Figure 3 shows the SOC prediction map of model 7 at different spatial resolutions, with the lake mask on the image, as a way to determine the image location situation. From the four resolutions, the SOC distribution condition in the image changes as the resolution increases. From the selected locations, the SOC generally tends to be low at 10 and 100 m resolutions, with the mean values of the two resolutions being closer to 14.967 and 14.295, respectively. The 100 m mean value of 12.485 is closer to the predicted mean value than the measured value.
The standard deviation of 10m and 100m is 6.459 and 8.112. The mean value of 300m is 12.944 with a standard deviation of 2.475. The mean value for 500m is 13.758 with a standard deviation of 3.787. Resolution is increasing, the image information is gradually blurred, and the prediction effect keeps changing.

Sentinel-1A/2A/3A for SOC Prediction
In this article, two kinds of machine learning methods, RF and Cubist models, have been chosen for prediction of SOC distribution, and the RF model effects are shown in Figure 3. The prediction accuracy is described from different spatial resolution transformations. The two chosen prediction models have been accurately and effectively applied in soil organic carbon, and the accuracy of SOC prediction mapping has been confirmed [87]. In terms of overall model precision, the overall accuracy of the RF model was superior to the Cubist model, which is consistent with the results presented by Pouladi [81] in predicting soil organic matter distribution, who mapped and compared soil organic matter predictions by five models (Cubist, Random Forest, Cubist-kriging, Random Forest-kriging, Kriging) and obtained the same result as the model selected for this paper. Akpa [82] predicted SOC variation of different soil layers and the prediction showed that the RF model was slightly preferred to the Cubist model, and the classification of SOC at different resolutions has also been studied from the last few years [48]. However, SAR data were not analyzed with two different resolutions of spectral data. On this basis, SAR data are combined with optical data to serve as a data base for soil organic carbon prediction, and the predictive power of different types of data for digital mapping of soil organic carbon is explored.
In the prediction of soil organic carbon using Sentinel data, firstly, Sentinel-1A (SAR data) has the advantage in predicting soil and is better for most soil attribute data [88]. The use of Sentinel-1A as a covariate for soil organic carbon is less studied, but it cannot be denied that high-accuracy SAR data are a promising data source and effective in avoiding the effect of light sources on soil properties for remote sensing prediction of SOC [45]. Multispectral remote sensing data are no longer limited to for land use classification and are used to improve soil mapping accuracy while continuously improving remote sensing monitoring accuracy. Sentinel-2A data are well established in mapping and are used to improve mapping accuracy through different spectral resolutions and better signal-to-noise ratio in SWIR bands. The advantage of S-2 is that it has a clear advantage in quantitative mapping studies of SOC at different spatial scales, and that good matching between the two can be achieved [89]. The Sentinel-3A OLCI image contains bare soil spectrums band for the surface bare soil monitoring capability, and although the spatial resolution is lower, the short revisit period and large land coverage can provide more comprehensive data for predicting SOC [90]. Combining Sentinel-2A and Sentinel-3A spectral data can make up for the deficiencies of both in many aspects, and can enhance the graphical prediction to a collection of spectral resolution from small to medium to large scale, thus enriching the remote sensing monitoring of soil data by spectral data [48].
In the prediction of different data, the accuracy obtained by different models is shown in Table 5. The prediction accuracy of SAR data are better and superior to that of spectral data, and the complementary prediction effects of the two data sources in the spectral data model at different spatial resolutions can make the prediction results achieve their respective effects at different resolutions. In the study of kim [91], the effect of remote sensing modeling by different scales of spatial resolution data are more consistent with this study. In combining spectral data with DEM derivatives, the combination of radar data with environmental variables is better than that of spectral data, and the two data reflect their better prediction accuracy at different resolutions. In this paper, the data are divided into four scales to analyze the data variation in the case of different scale data. The multi-scale analysis can better reflect the direct influence of the selected data sources on the prediction results [48], and thus infer the optimal variables for SOC prediction analysis at different regional scales.

Analysis of Environmental Variables
The importance of DEM derivatives on SOC is highlighted in the importance plot ( Figure 2). The various topographic attributes have some obvious influence on SOC [92], in agreement with the conclusions obtained in this paper, in terms of the different degrees of influence; SOC was strongly dependent on the topography and showed a certain distribution [93]. Surface soils are influenced by surface materials such as vegetation cover and deposition, water distribution and biological migration, and these characteristics all have an impact on the surface SOC of the soil [93]. Among the topographic factors, elevation, slope, and aspect were high in the order of importance, which is consistent with the conclusions obtained in this paper. Topography and slope play an important predictive role in DEM derivatives, and topography influences the variation of land surface biomass and thus the variation of SOC storage [94]. S.A.Bangroo [95] proposed in his study that topography have a direct influence on the distribution and spatial variation of SOC. Arun Mondal [96] concluded in a study that the change in SOC concentration decreases with increasing slope. Similarly, topographic indices such as TWI control the distribution of SOC in certain terrain, and TWI acts as an effective topographic factor to influence soil texture changes in terms of soil erosion and migration along with runoff calculations [97,98]. In this study, some DEM derivatives are also proposed to influence SOC prediction, such as fa, hcurv, chnl_alti, ls, ah; these indices also have some contribution to SOC prediction, Channel network will cause vegetation reduction to have some effect on surface cover change and thus affect soil change [99]. Adhikari, K [100] studied chnl_alti with wetness index, elevation, slope gradient, slope-length factor, as important topographic factors influencing the distribution of SOC in Denmark. Longitudinal Curvature [101] provides a good interpretation of the soil profile and geological structure and correlates surface material erosion with spatial distribution. Longitudinal curvature and analytical hill shading proved to be the most important environmental variables among the many DEM derivatives [102]. The prediction of SOC can provide effective environmental variables and can improve the prediction accuracy [103].

Comparison of Spatial Prediction Models
In this study, the digital soil mapping technique was used to obtain a soil organic carbon map for the Ebinur lake basin in the arid zone. For this study area, soil organic carbon mapping has not been studied, but the study of soil carbon storage and distribution characteristics will appear in previous studies [104]. For the spatial distribution of soil organic carbon (SOC), the main manifestation is the higher soil organic carbon content in the oasis region, a phenomenon that does not exclude the influence of human activities on SOC [105], while vegetation growth and withering processes and animal activities also contribute to this phenomenon [106]. The SOC content was lower in relatively wet and relatively dry environments, around water bodies and in desert areas, which is consistent with the results of [107]. SOC content varies with altitude, the higher the altitude the lower the SOC content, indicating a trend of SOC transport to lower altitudes [108,109]. Changes in temperature and precipitation due to altitude also have a significant effect on the distribution of SOC [16] and urbanization within oasis regions leads to higher temperature production [110], which in turn predicts that the areas with higher SOC in the figure are several major urban areas, and that SOC content decreases in areas with higher temperatures [111]. In contrast, the areas with more precipitation in the region are mostly higher-altitude mountainous areas, making the surface parent layer unstable can reduce the SOC distribution by being too dispersed [14]. The study area is located in an arid zone, where SOC distribution projections play an integral role in carbon stock estimation under the influence of climate change, human activities and topography [112][113][114].

Conclusions
In this study, the distribution of soil organic carbon in the arid zone was studied and analyzed by three data sources (S-1, S-2, and S-3) from two sensors (radar and optical) at four spatial resolutions (10, 100, 300, and 500 m) after two machine learning algorithms (RF, Cubist) to predict the SOC in the study area, and the following conclusions were obtained: (1) The simulation accuracies of the three data sources are ranked as Sentinel-1A (Model A) > Sentinel-2A (Model B) > Sentinel-3A (Model C). The prediction performance of the three data at different spatial resolutions is better for Sentinel-1A and Sentinel-2A at 10 m resolution and best for Sentinel-3A at 500 m.
(2) Combining all environmental variables, the best model is model G. Model G is a combination of radar data, optical data and all environmental variables. In this model, the RF method has the best modeling effect at 10 m, R 2 = 0.406, MAE = 0.162, REMS = 5.947, LCCC = 0.266. In model E that combines SAR data with environmental variables, the prediction effect of 300 m is best to reach R 2 = 0.383. In model F that combines spectral data (S-2, S-3) with environmental variables, the 100 m prediction effect is best to reach R 2 = 0.397. (3) From the overall perspective, the accuracy of the RF model is better than that of Cubist among the two machine learning models, and the RF model can be used to predict SOC in arid areas in the future. (4) The spatial distribution of SOC shows that the SOC content is higher in oases, and lower in mountainous areas and areas around lake.