Modeling Soil Moisture from Multisource Data by Stepwise Multilinear Regression: An Application to the Chinese Loess Plateau

This study aims to integrate multisource data to model the relative soil moisture (RSM) over the Chinese Loess Plateau in 2017 by stepwise multilinear regression (SMLR) in order to improve the spatial coverage of our previously published RSM. First, 34 candidate variables (12 quantitative and 22 dummy variables) from the Moderate Resolution Imaging Spectroradiometer (MODIS) and topographic, soil properties, and meteorological data were preprocessed. Then, SMLR was applied to variables without multicollinearity to select statistically significant (p-value < 0.05) variables. After the accuracy assessment, monthly, seasonal, and annual spatial patterns of RSM were mapped at 500 m resolution and evaluated. The results indicate that there was a high potential of SMLR to model RSM with the desired accuracy (best fit of the model with Pearson’s r = 0.969, root mean square error = 0.761%, and mean absolute error = 0.576%) over the Chinese Loess Plateau. The variables of elevation (0–500 m and 2000–2500 m), precipitation, soil texture of loam, and nighttime land surface temperature can continuously be used in the regression models for all seasons. Including dummy variables improved the model fit both in calibration and validation. Moreover, the SMLR-modeled RSM achieved better spatial coverage than that of the reference RSM for almost all periods. This is a significant finding as the SMLR method supports the use of multisource data to complement and/or replace coarse resolution satellite imagery in the estimation of RSM.


Introduction
Soil moisture (SM) is widely recognized as a vital land surface variable that associates with land-atmosphere interaction [1,2], rainfall-runoff processes [3], water-energy balance [4], and climate change [5]. Accurate characterization of SM is beneficial for applications such as weather and climate modeling, agricultural and water resources management over larger spatial extents [6]. Accurate and timely SM estimation at a relevant spatiotemporal scale is a sound strategy to forecast droughts/floods [7,8] and schedule irrigation [9,10] for the sustainability and productivity of agriculture, particularly in arid and semi-arid regions like the Chinese Loess Plateau (CLP).
Various remote sensing data, spanning almost all regions of the electromagnetic spectrum from bands of microwave to visible, have been utilized for SM retrieval since

Study Area
This study area is the CLP-Chinese Loess Plateau (100 • 54 -114 • 33 E and 33 • 43 -41 • 16 N), northwestern China, which covers an area of approximately 640,000 km 2 , spanning seven provinces (Figure 1a). The landscape is strongly shaped by wind-water erosion and has a highly fractured landform of gullies [74]. Both the mean annual temperature and precipitation gradually decrease from the southeast (14 • C and 750 mm) to the northwest (4 • C and 200 mm) [75]. The CLP has been categorized as the most seriously eroded landscape in the world because of its loose and erodible soil [76,77]. The already low and concentrated precipitation (the mean annual precipitation is 420 mm and 55-78% fall in the wet season from July to September) makes the CLP particularly vulnerable to drought [78]. In addition, SM over the CLP shows significant spatial variation due to climatic characteristics and fragmented topography [36,79]. Therefore, quantitative estimations of the SM over the CLP are more important than for other regions.
In the present study, to ensure uniform distribution of calibration and validation samples (without overlapping), both the 7814 calibration samples (small red points in Figure 1b) and the 7824 validation samples (small blue points in Figure 1b) were selected at 10 km intervals while the distance between the adjacent calibration and validation samples was 5 km. A total of 298 Chinese automatic meteorological stations (large red points in Figure 1b) provided hourly precipitation and relative air humidity observation data over the CLP.

Soil Moisture Data
The relative soil moisture (RSM) represents the percentage of SM that accounts for the moisture storage capacity and was used to describe the SM levels in the present study. The monthly, seasonal, and annual RSM maps of the CLP in 2017 (previously published [32]) were used and regarded as the reference RSM for modeling in this study. The previously published RSM was generated via 8-day RSM maps. The overall 8-day RSM was combined at a 500 m resolution by corresponding subregional RSM, which was produced with three groups of selected optimal NDVI thresholds using MODIS-derived ATI (apparent thermal inertia) and TVDI (temperature vegetation dryness index), and the average of ATI and TVDI against 20 cm depth in situ RSM observations [32]. Here, many studies pointed out TVDI and ATI could adequately reflect the changes of RSM at a 20 cm depth [11,[80][81][82]. In terms of RSM estimation using the ATI-based model, soil thermal inertia (TI) is described as a thermal property of soil that characterizes its resistance to temperature change and has been used for near-surface SM retrieval [11,83]. ATI, to simplify the TI, is calculated by spectral surface albedo and the diurnal land surface temperature (LST) range [31,84]. The ATI method has been used for monitoring RSM for bare soil or sparsely vegetated regions. In addition, as an effective method based on the NDVI-LST feature space, the TVDI-based model considers vegetation coverage in RSM estimation and has been widely applied to vegetated areas. To estimate 8-day subregional RSM, the overall CLP was divided into three subregions (the ATI subregion, the TVDI subregion, and the ATI/TVDI subregion) according to the NDVI of individual pixels. The ATI-based model, the TVDI-based model, and the ATI/TVDI joint model were used in the ATI subregion, the TVDI subregion, and the ATI/TVDI subregion, respectively, and corresponding subregional RSM data were obtained. Therefore, the equations were used as follows:

Soil Moisture Data
The relative soil moisture (RSM) represents the percentage of SM that accounts for the moisture storage capacity and was used to describe the SM levels in the present study. The monthly, seasonal, and annual RSM maps of the CLP in 2017 (previously published [32]) were used and regarded as the reference RSM for modeling in this study. The previously published RSM was generated via 8-day RSM maps. The overall 8-day RSM was combined at a 500 m resolution by corresponding subregional RSM, which was produced with three groups of selected optimal NDVI thresholds using MODIS-derived ATI (apparent thermal inertia) and TVDI (temperature vegetation dryness index), and the average of ATI and TVDI against 20 cm depth in situ RSM observations [32]. Here, many studies pointed out TVDI and ATI could adequately reflect the changes of RSM at a 20 cm depth [11,[80][81][82]. In terms of RSM estimation using the ATI-based model, soil thermal inertia (TI) is described as a thermal property of soil that characterizes its resistance to temperature change and has been used for near-surface SM retrieval [11,83]. ATI, to simplify the TI, is calculated by spectral surface albedo and the diurnal land surface temperature (LST) range [31,84]. The ATI method has been used for monitoring RSM for bare soil or sparsely vegetated regions. In addition, as an effective method based on the NDVI-LST feature space, the TVDI-based model considers vegetation coverage in RSM estimation and has been widely applied to vegetated areas. To estimate 8-day subregional RSM, the overall CLP was divided into three subregions (the ATI subregion, the TVDI subregion, and the ATI/TVDI subregion) according to the NDVI of individual pixels. The ATI-based model, the TVDI-based model, and the ATI/TVDI joint model were used in the ATI subregion, the TVDI subregion, and the ATI/TVDI subregion, respectively, and corresponding subregional RSM data were obtained. Therefore, the equations were used as follows: where RSM overall represents the overall RSM and it is combined by three subregional RSM (RSM ATI , RSM ATI/TVDI , and RSM TVDI ). RSM ATI and RSM TVDI are the RSM estimated by the ATI-based and TVDI-based models, respectively, and RSM ATI/TVDI is the RSM estimated by the ATI/TVDI joint model. a ATI and b ATI are coefficients from fitting the ATI values and in situ RSM observations in the ATI subregion. a TVDI and b TVDI are coefficients from fitting the TVDI values and in situ RSM observations in the TVDI subregion. a ATI/TVDI and b ATI/TVDI are coefficients from fitting the average value of ATI and TVDI and in situ RSM observations in the ATI/TVDI subregion. NDV I ATI and NDV I TVDI are the selected optimal thresholds for generating subregions. Three optimal NDVI thresholds (NDVI 0 was used for computing TVDI, and both NDVI ATI and NDVI TVDI for dividing the entire CLP) were identified with the best validation results of subregions for 8-day periods. To assess the performance of the models over the CLP, the Pearson's r and the mean absolute error (MAE) of the reference RSM against in situ RSM observations were calculated (at the monthly, seasonal, and annual scales). The r and MAE varied from 0.47 in October to 0.68 in January and from 3.02% in March to 4.97% in October, respectively, on the monthly scale (Table A1) [32]. Since the r of reference RSM and in situ RSM observations had a moderate correlation coefficient (r = 0.73 for the annual scale) [85], the reference RSM was assumed to be the actual RSM with larger spatial coverage than those in situ RSM observations and the reference RSM kept a better trend with the in situ observed RSM at the station scale. The area of the reference RSM within five months (April, May, July, August, and October) was less than half of the entire study area (~32 × 10 4 km 2 ) (Table A1). Thus, the SMLR method was applied to improve the spatial coverage of our previously published RSM based on the reference RSM and multisource data.

Candidate Variables
A total of 17 features (12 quantitative and five categorical variables), including MODISderived features, topographical features, soil properties, and meteorological features, were selected as candidate variables for RSM modeling using SMLR. These candidate variables in our study were applied to estimate RSM in previous studies [18,34,39,40]. To be more specific, the 12 quantitative variables included daytime LST (DL) [18,40], nighttime LST (NL), diurnal differences in LST (DIL), evapotranspiration (ET) [18,38,39], enhanced vegetation index (EVI), difference vegetation index (DVI), ratio vegetation index (RVI), normalized difference vegetation index (NDVI) [34], enhanced vegetation index 2 (EVI2), modified soil-adjusted vegetation index (MSAVI), precipitation (PRE) [38,39], and relative humidity (RH) [86]. Moreover, the categorical variables were land cover (LC) [36], elevation (DEM) [18,35], slope gradient (SG), slope aspect (SA), and soil texture (ST) [18,34,37,40], which were declared as dummy variables in the regression equations. The variables used in this study are described in Table 1 (see Table A2 for the calculation of the vegetation indexes). A dummy variable is an artificial variable that is created to represent an attribute with two or more distinct levels/categories [87,88]. To avoid the dummy variable trap (a scenario in which independent variables are collinear), one less dummy variable (n-1) than the categorical values (n) was used [89]. Thus, dummy variables, including elevation (seven categories and six dummy variables), land cover (three categories and two dummy variables), slope aspect (four categories and three dummy variables), slope gradient (six categories and five dummy variables), and soil texture (seven categories and six dummy variables) were used in this study. The spatial pattern of the annual reference RSM and 17 features are shown in Figure 2. Detailed pre-processing from features to candidate variables is given in the later section.

MODIS Data
The LST and vegetation indexes are two important parameters closely related to RSM and are often used on RSM estimations [90]. The 8-day 1 km composite LST product (MOD11A2), the 8-day 500 m evapotranspiration (ET) product (MOD16A2), the 8-day 500 m surface reflectance product (MOD09A1), and the annual 500 m land cover product (MCD12Q1. Type2) of 2017 were used to develop the variables of DL, NL, ET, LC, and six vegetation indexes. The utilized MODIS data were downloaded from the website of the level-1 and Atmosphere Archive and Distribution System (LAADS) Distributed Archive Center (DAAC) [91]. In order to facilitate the utilization of MODIS products, all pixels were processed for quality assurance based on the valid range before further computing. The LST products, including DL and NL, were resampled to a reso-

MODIS Data
The LST and vegetation indexes are two important parameters closely related to RSM and are often used on RSM estimations [90]. The 8-day 1 km composite LST product (MOD11A2), the 8-day 500 m evapotranspiration (ET) product (MOD16A2), the 8-day 500 m surface reflectance product (MOD09A1), and the annual 500 m land cover product (MCD12Q1. Type2) of 2017 were used to develop the variables of DL, NL, ET, LC, and six vegetation indexes. The utilized MODIS data were downloaded from the website of the level-1 and Atmosphere Archive and Distribution System (LAADS) Distributed Archive Center (DAAC) [91]. In order to facilitate the utilization of MODIS products, all pixels were processed for quality assurance based on the valid range before further computing. The LST products, including DL and NL, were resampled to a resolution of 500 m (reference RSM resolution) and were chosen to calculate the diurnal differences in LST (DIL = DL − NL) [18,40]. Pixels were masked when the DL was less than 0 • C, suggesting snow coverage or frozen soil at that time and location. Strong positive relationships between RSM and the vegetation index (e.g., NDVI) were found for Australia [92], tallgrass prairies in the USA [93,94], croplands in North China [95], and East Africa [96]. Surface reflectance products (including seven bands) were chosen to calculate the five vegetation indices: EVI, DVI, RVI, NDVI, EVI2, and MSAVI [34]. Monthly, seasonal, and annual data applied for modeling were combined by these 8-day composite products.
The influences of land covers on RSM are complex [36,37]. According to the MCD12Q1. Type2 from the University of Maryland (UMD) land cover classification scheme (with 16 different cover types) [97], the land cover data were reclassified here into three categories, namely croplands, forest, and shrublands, as well as other land covers in the study area ( Table 2). With regard to the variable of land cover (a categorical variable), with other land covers as reference (row in the orange color background in Table 2), the remaining two land cover types (LC1 and LC2, rows in the green color background in Table 2) were integrated as dummy variables in the modeling procedure (with only two values; 0 and 1) [98,99].

Topographic Data
Digital elevation model (DEM) data were used because the distribution of the RSM and other features are directly related to elevations [18,100]. The 90 m spatial resolution Shuttle Radar Topography Mission (SRTM) DEM, SG, and SA datasets for the study area were downloaded from the Geospatial Data Cloud (GDC) Platform (http://www.gscloud.cn/, accessed on 11 October 2020) and were then resampled to a 500 m spatial resolution. These three chosen features were used as dummy variables in this study. DEM was classified into seven categories at 500 m intervals. With elevations exceeding 3000 m as the reference, the remaining six elevation categories were integrated as dummy variables (DEM1, DEM2, DEM3, DEM4, DEM5, and DEM6) in the modeling procedure (Table 3). A total of 52% of the CLP was in the elevation categories from 1000 to 1500 m. Slope gradients, ranging from 0 • to 64.85 • over the CLP (Figure 2n), were classified into six categories at 5 • intervals. With SG exceeding 25 • as the reference, the remaining five SG categories were also integrated as dummy variables (SG1, SG2, SG3, SG4, and SG5) in the modeling procedure (Table 3). Table 3. Elevation and slope gradient classification for the utilized dummy variables (green color rows represent dummy variables of DEM1, DEM2, DEM3, DEM4, DEM5, DEM6, SG1, SG2, SG3, SG4, and SG5, and reference elevation (3000 m and above) and reference slope gradient (25 • and above) are the rows with the orange color background). The slope aspect (SA) played an important role in the distribution of RSM at the hillslope domain and should be considered when attempting to characterize RSM variability in gullied regions [101]. Here, SA was divided into four categories (semi-shady, semi-sunny, shady, and sunny, as shown in Figure A1). Using the sunny category as the reference, the remaining three categories were integrated as dummy variables (SA1, SA2, and SA3) in the modeling procedure (Table 4). Table 4. Slope aspect classification of dummy variables (green color rows represent dummy variables of SA1, SA2, and SA3 and the reference slope aspect of sunny is the row with the orange color background).

Class
Range ( [102,103]. Maps of the sand, clay, and silt content for the study area were provided by the Data Center for Resource and Environmental Sciences, Chinese Academy of Sciences (RESDC, http://www.resdc.cn/, accessed on 20 August 2020) (Figure 3). High percentage sand content was clustered in the northwestern region of the CLP, and low percentage sand content and high percentage silt content covered the southern CLP. Soil texture was classified into 12 types using the U.S. Department of Agriculture (USDA) Soil Texture Classification System ( Figure A2 and Table 5). A total of seven textural classes were identified over the CLP and more than half of the CLP (51.493%) was covered by sandy loam. The textural classes were integrated as dummy variables (ST1, ST2, ST3, ST4, ST5, and ST6) using the class of sandy loam as the reference group in the modeling procedure (Table 5). Soil texture was classified into 12 types using the U.S. Department of Agriculture (USDA) Soil Texture Classification System ( Figure A2 and Table 5). A total of seven textural classes were identified over the CLP and more than half of the CLP (51.493%) was covered by sandy loam. The textural classes were integrated as dummy variables (ST1, ST2, ST3, ST4, ST5, and ST6) using the class of sandy loam as the reference group in the modeling procedure (Table 5). Table 5. Soil texture classification for dummy variables (green color rows represent dummy variables of ST1, ST2, ST3, ST4, ST5, and ST6, and reference soil texture of sandy loam is the row with the orange color background).

Meteorological Data
Precipitation (PRE), both the amount and intensity, has been shown to be a major driver of SM dynamics [104][105][106]. PRE and relative humidity (RH) were also applied to RSM estimation [86]. The network of in situ automatic meteorological stations of the Chinese Meteorological Data Service Center (CMDC) provided the meteorological inputs required by the model: RH and PRE (both acting as quantitative variables). Hourly RH (%) and PRE (mm) were recorded at 298 automatic meteorological stations ( Figure 1) over the CLP in 2017 [107]. Concerning RH, for an accurate temporal match between in situ observation data and monthly reference RSM (produced by averaging corresponding 8-day RSM), the daily granule acquisition time of MOD09GA products was obtained to serve as the reference for selecting corresponding in situ RH observations. The monthly in situ RH value at each meteorological station was computed by averaging the daily RH. Then, the spatially interpolated inverse distance weighted (IDW) method was applied to express the RH spatial patterns at a resolution of 500 m [108]. The same procedure was performed for seasonal and annual RH maps ( Figure 2c). The mean annual RH over the CLP in 2017 ranged from 34.15% to 66.88%.
Regarding PRE, daily PRE with cumulative values were also rescaled to monthly, seasonal, and annual temporal resolutions. After interpolation by inverse distance weighting, monthly, seasonal, and annual PRE maps were generated [18,109]. The maximum and minimum total annual PRE in 2017 was 1020.15 mm and 99.80 mm, respectively ( Figure 2b). Both PRE and RH appeared to be gradually decreasing from southeast to northwest and the year 2017 was a typical year from a climate perspective [110,111].

Stepwise Multilinear Regression Modeling
The calibration samples for each period were used to constructing stepwise multilinear regression (SMLR) models first. As we presented in Figure 1, 7814 calibration samples and 7824 validation samples were selected at 10 km intervals while the distance between the adjacent calibration and validation samples was 5 km. Table 6 displays the different used samples for calibration and validation due to different RSM areas for each period. The next step is to determine the actual set of variables used from 34 candidate variables (12 quantitative variables and 22 dummy variables) in the final regression. SMLR is routinely used for finding important variables while multicollinearity among variables often undermines its performance [98,112]. To select variables without multicollinearity, the general MLR model was first applied using calibration samples with 34 independent variables for each period (monthly, seasonal, and annual data) ( Figure 4). The MLR model involves a series of single factor correction coefficients and, in this study, does not consider interactions (excluding transformed variables) among variables. An MLR can be represented as: where γ represents the dependent variable (reference RSM), and β 0 and β i represent the constant offset and regression coefficients of the corresponding explanatory variables X i , respectively. The deviation between model outputs and reference RSM represents the model bias ε. Since collinearity likely exists among 34 candidate variables, the Variation Inflation Factor (VIF) [98] was used to examine it: where R i represents the correlation coefficient between the i th predictive variable and the remaining predictive variables. No multicollinearity exists if VIF is less than 3 [109,113].
If VIF exceeded 3 (which indicates multicollinearity), the variable with the highest VIF was removed and the model was re-evaluated. Then, the candidate variables with a VIF of less than 3 were prepared for SMLR modeling. In an SMLR analysis, the most significant or least significant variable is iteratively added to or removed from the MLR model based on its statistical significance [98]. Statistically significant variables were identified through continual regression iterations in the linear regression equation (p < 0.05 was applied in this study). This method can effectively select powerful features for the construction of a good predictive model and has been widely used in different fields [89,98,99,114], including SM estimation [34,62]. In addition, although there are many different strategies for selecting variables for a regression model, all possible regression procedures should be used if there are no more than fifteen candidate variables and the SMLR could be used for more than fifteen candidate variables [115]. As such, the SMLR was considered more suitable for constructing RSM models in this study. Since collinearity likely exists among 34 candidate variables, the Variation Inflation Factor (VIF) [98] was used to examine it: where represents the correlation coefficient between the th predictive variable and the remaining predictive variables. No multicollinearity exists if VIF is less than 3 [109,113].
If VIF exceeded 3 (which indicates multicollinearity), the variable with the highest VIF was removed and the model was re-evaluated. Then, the candidate variables with a VIF of less than 3 were prepared for SMLR modeling. In an SMLR analysis, the most significant or least significant variable is iteratively added to or removed from the MLR model based on its statistical significance [98]. Statistically significant variables were identified through continual regression iterations in the linear regression equation (p < 0.05 was applied in this study). This method can effectively select powerful features for the construction of a good predictive model and has been widely used in different fields [89,98,99,114], including SM estimation [34,62]. In addition, although there are many different strategies for selecting variables for a regression model, all possible regression procedures should be used if there are no more than fifteen candidate variables and the SMLR could be used for more than fifteen candidate variables [115]. As such, the SMLR was considered more suitable for constructing RSM models in this study.

Accuracy Assessment
The validation samples (presented in Figure 1b) for each period were used to assess the accuracy of the constructed SMLR models. Statistical metrics, including root mean square error (RMSE), Pearson's r (r), Adjusted R 2 (Adj. R 2 ), MAE, and standard deviation (STD) were calculated to evaluate the performance of the simulation [11,116]. The agreement and degree of dispersion between the modeled RSM via SMLR and reference RSM were analyzed here in terms of these five classical statistical criteria for each period. The equations are presented in Table A3. The modeled RSM data, which used multisource data via the SMLR method at the monthly, seasonal, and annual scales, were evaluated and compared with the reference RSM.

Stepwise Multilinear Regression Model
The SMLR models were established using selected variables to simulate RSM. The results of performing SMLR for constructing the RSM models are presented in Table 7. For the SMLR models, the model fit was generally assessed by its Adj. R 2 . In this case, the models that were developed to estimate RSM fitted best with the highest Adj. R 2 of 0.912 (i.e., 91.2% of the variation of the dependent variable RSM can be explained by the change in the independent variables) and the lowest RMSE of 0.798% in winter among all four seasons in 2017. Moreover, the best model fitting results were obtained in December (Adj. R 2 = 0.938 and RMSE = 0.753%) among the months (Table A4). The linear regression equation had the lowest Adj. R 2 of 0.091 and highest RMSE of 4.182% in October. Focusing on intercepts of the models, Table 7 shows that there are both negative (in winter and autumn) and positive (in spring and summer) intercepts, and the annual regression model is identified with positive intercepts (125.231). Clearly, the selected variables and the number of selected variables for each regression model varied with the assessed periods. The directions of regression coefficients for variables in the SMLR models (Table 7) indicate the positive or negative correlations between RSM and the variables. The selected variables with directions of regression coefficients in the models at monthly, seasonal, and annual scales are shown in Figures 5 and A3. In total, 12 quantitative variables and five categorical variables (represented by 22 dummy variables) were used in this study. Among these 34 candidate variables, 27 variables were selected for monthly models.
Interestingly, the directions of the regression coefficients for an individual variable were not fixed; for example, the variable of PRE (labeled with blue stars in Figure A3). A positive correlation was found with the RSM in March, June, August, and November, but negative directions of regression coefficients were found in July, September, October, and December. Moreover, the selected variables of ST1, ST3, and DIL also kept the same direction in the regression models for the periods.
The number of selected variables for modeling at seasonal and annual scales (23 selected variables) was slightly lower than that at the monthly scale (27 selected variables). The variables of PRE with positive regression coefficients (except in winter), DEM1 (elevation of 0-500 m with positive regression coefficients) were used to simulate RSM in the seasonal and annual regression models. Focusing on the annual regression model, 17 selected variables of DVI, PRE, DEM1, DEM2, DEM4, DEM5, DEM6, LC1, LC2, SG3, SG4, ST1, and ST3 were positively correlated and DL, NL, ST4, ST5, were negatively correlated and were applied to simulate RSM. Among the five categorical variables (elevation, slope gradient, slope aspect, land cover, and soil texture), four categorical variables were used (except for slope aspect) in the annual and seasonal linear equations.
Interestingly, the directions of the regression coefficients for an individual variable were not fixed; for example, the variable of PRE (labeled with blue stars in Figure A3). A positive correlation was found with the RSM in March, June, August, and November, but negative directions of regression coefficients were found in July, September, October, and December. Moreover, the selected variables of ST1, ST3, and DIL also kept the same direction in the regression models for the periods. , and ST3 were positively correlated and DL, NL, ST4, ST5, were negatively correlated and were applied to simulate RSM. Among the five categorical variables (elevation, slope gradient, slope aspect, land cover, and soil texture), four categorical variables were used (except for slope aspect) in the annual and seasonal linear equations.
To identify the contribution of dummy variables to the SMLR models, SMLR was conducted without dummy variables at the seasonal and annual scales and Table 8 shows the comparison results. As expected, including dummy variables improved the model fit in the calibration as well as in the validation because almost all periods (except for summer) of the Adj. R 2 with dummy variables were higher than those without dummy variables. This might be contributed to the low accuracy of reference RSM at that time. It is also possible that other selected numerical variables, such as precipitation, have significant effects on RSM changes, while dummy variables appear to have relatively weak influence as a whole. Directions of regression coefficients for selected variables To identify the contribution of dummy variables to the SMLR models, SMLR was conducted without dummy variables at the seasonal and annual scales and Table 8 shows the comparison results. As expected, including dummy variables improved the model fit in the calibration as well as in the validation because almost all periods (except for summer) of the Adj. R 2 with dummy variables were higher than those without dummy variables. This might be contributed to the low accuracy of reference RSM at that time. It is also possible that other selected numerical variables, such as precipitation, have significant effects on RSM changes, while dummy variables appear to have relatively weak influence as a whole. Table 8. Comparison of the accuracy of the stepwise multilinear regression (SMLR) models with and without dummy variables in calibration and validation at the seasonal and annual scales (the results of models with dummy variables that performed better are formatted in bold).  Figure 6 illustrates the results of the assessment of RSM models at seasonal and annual time scales. The highest r of 0.955 and Adj. R 2 of 0.912 and the lowest STD of 0.770%, RMSE of 0.809%, and MAE of 0.622% were found in winter (Figure 6a). The annual validation result was characterized by a relatively high correlation coefficient (r > 0.75) and low error (RMSE = 1.725%), on the considered subset of 7493 samples (Figure 6e). 0.75) and low error (RMSE = 1.725%), on the considered subset of 7493 samples ( Figure  6e).

Accuracy Assessment
The model validation with the available sampled dataset at the monthly scale is presented in Figure A4. It should be noted that there were six months (December, January, February, April, May, and November) with Adj. R 2 exceeding 0.800 in 2017 and relatively low errors (RMSE) of the models during the winter season (December, January, and February), ranging from 0.761% to 1.050%. Importantly, all intercepts and slopes of the linear fits to the modeled RSM were between 0 and 1, which indicates an overestimation of the model in the lower RSM region and underestimation of the model in the higher RSM region.   The model validation with the available sampled dataset at the monthly scale is presented in Figure A4. It should be noted that there were six months (December, January, February, April, May, and November) with Adj. R 2 exceeding 0.800 in 2017 and relatively low errors (RMSE) of the models during the winter season (December, January, and February), ranging from 0.761% to 1.050%. Importantly, all intercepts and slopes of the linear fits to the modeled RSM were between 0 and 1, which indicates an overestimation of the model in the lower RSM region and underestimation of the model in the higher RSM region. Figure 7 depicts the spatial patterns of the seasonal and annual RSM over the CLP. Colors varying from red to blue indicate the change of RSM from lowest (i.e., driest soil) to highest (i.e., wettest soil). Overall, the spatiotemporal distribution of the modeled RSM via the SMLR model had the same pattern as the reference RSM. RSM decreased gradually from southeast to northwest and indicated an overall spatial distribution pattern of "wet in south and southeast, dry in northwest". Such variation was also displayed in the annual PRE, RH, NL, and ET maps (Figure 2). The areas with low RSM (i.e., below 10%) were mainly clustered in the northwestern region of the CLP. These areas have a temperate continental climate, with little rainfall, low RH, strong daylight, a high proportion of sand content, and low ET because of the low vegetation coverage. RSM was higher in the western region of the CLP (with the highest elevation) in spring and summer. The mean modeled RSM was highest (13.899%) in autumn and lowest (10.276%) in winter (Figure 7).

Modeled Soil Moisture
The annual mean modeled RSM (12.158%) (Figure 7e), modeled via the SMLR model, was higher than that of the reference RSM (10.160%).
The monthly RSM maps from the multisource data modeled via the SMLR models are shown in Figure A5. Compared with the monthly reference RSM, a significant improvement was achieved in the spatial coverage of the modeled RSM via the SMLR model. The modeled RSM area of 10 months (except for January and February) was larger than that of the reference RSM in 2017. To be more specific, the area of modeled RSM increased more than two-fold over five months (April: 142.082%, May: 113.209%, July: 149.955%, August: 115.120%, and October: 243.008%) and the increased area of the modeled RSM ranged from 1.020 × 10 4 km 2 in December to 44.519 × 10 4 km 2 in October.  The monthly RSM maps from the multisource data modeled via the SMLR models are shown in Figure A5. Compared with the monthly reference RSM, a significant improvement was achieved in the spatial coverage of the modeled RSM via the SMLR model. The modeled RSM area of 10 months (except for January and February) was larger than that of the reference RSM in 2017. To be more specific, the area of modeled RSM increased more than two-fold over five months (April: 142.082%, May: 113.209%, July: 149.955%, August: 115.120%, and October: 243.008%) and the increased area of the modeled RSM ranged from 1.020 × 10 4 km 2 in December to 44.519 × 10 4 km 2 in October.

Discussion
According to the results of the model fit, the highest Adj. R 2 of 0.912, both in calibration and in validation, was found in winter. Accordingly, regression models in December, January, and February demonstrated much better performances (Adj. R 2 ranged from 0.853 to 0.939 in validation) compared with other months. These results were better than those that were obtained by Lee et al. [57] who used MLR for South Korea (where the R 2 ranged from 0.17 to 0.63). The linear regression equation had the lowest Adj. R 2 of 0.091 in October (Adj. R 2 of 0.073 in validation), which was mainly the result of the poor quality of the reference RSM with the lowest r of 0.47 against in situ RSM observations among months [32]. Similarly, the validation results for autumn (winter) were considerably lower (higher) than the others. The modeled results using SMLR were highly related to the quality of the input data, especially the reference RSM we previously published. The reference RSM in autumn had the greatest error and had the highest Pearson's r in winter among the four seasons. This may be considered the reason for the similar validation results in the current study. To explore the reason or accurately estimate SM, a better model can be developed through future works. Thus, the performance of the SMLR model was found to be sensitive to the quality of the reference data.
For individual variables, the directions of regression coefficients in the regression equations varied between periods. Theoretically, the variables of precipitation and RSM should maintain a positive correlation when only precipitation is considered as an independent variable. However, in an arid area, RSM often reaches its highest value after a heavier rain; if several small rain events occur instead, RSM at a 20 cm depth does not increase and even keeps decreasing, as it is affected by other variables (e.g., temperature) in certain land covers [37]. In addition, many variables were selected in the regression models that weakened the effects of the variation of precipitation on RSM (even having a negative impact) compared with variables with strong positive influence. The interpretation of each variable in the models is, however, complex, and specific relationships between variables and RSM were not examined in this study but their influence would deserve further study. From the modeling, it is worth noting that variables such as ST1, ST3, and DIL, retained the same directions in the regression models throughout the period. This indicates that these variables demonstrate a stable correlation with RSM and maintain their impact even when other variables change [70].
Dummy variables like LC1 (11 months except for August) and ST5 (9 months) had high frequencies of application in the regression models among months. This suggests that after the effects of other variables were considered, these dummy variables could be higher or lower (indicating a positive or negative effect, respectively, according to the directions of the corresponding coefficient in the models) than the dependent variable of RSM. In particular, the land cover of croplands scored 0.632 points higher, at the annual regression model (Table 7) on the modeling RSM, than the reference groups (i.e., other land covers). In general, quantitative or categorical variables in regression models often had interaction effects with each other [89]. In the present study, the regression models did not allow for the possibility that interaction might occur among variables (no interaction term exists in the regression models). Thus, the regression coefficient for each variable could be individually interpreted as a statistically significant (p-value < 0.05) dependent variable (RSM). Several researchers reported that the number and position of the dummy variables affect the fitting degree and estimation accuracy of the resulting models [68,117,118]. Chen et al. found that the dummy variable model did not differ in regional biomass estimation ability [119].
The overall spatial pattern of the modeled RSM (with high RSM in the southern regions and low RSM in the northwestern areas of the CLP throughout the period) via SMLR was in good agreement with previous studies [11,32,120]. As for the mean RSM, the average of the mean modeled RSM of 12 months was 13.114%, which slightly exceeded that of the reference RSM (12.155%) [32]. Regarding the overestimation of the dry area and the underestimation of the wet area, this might be partly attributed to the overestimation or underestimation of certain selected variables. The present study did not consider such parameter uncertainty, which was considered as a limitation of this study. In addition, the area of modeled RSM in January and February was smaller (larger for other periods) than that of the reference RSM. Each of the selected variables in the regression models (i.e., the value and distribution) were analyzed, which showed that this was associated with the selected variable of ET. The smaller areas of modeled RSM in January and February were caused by the ET maps that were still incomplete at that time. Taking January as an example, the areas of ET and modeled RSM were 43.236 × 10 4 km 2 and 42.695 × 10 4 km 2 , respectively ( Figure 8). The region with modeled RSM, as shown in Figure 8c, also had ET value, and the area of 0.541 × 10 4 km 2 with ET but without an RSM value should be associated with other unavailable variables (e.g., DL) at that time. Pixels with a DL below 0 • C were removed to avoid freezing temperatures in winter [104]. Therefore, a limitation of the SMLR model, to some extent, might be the fact that the availability of independent variables directly affects the coverage of the modeled RSM. km 2 and 42.695 × 10 4 km 2 , respectively ( Figure 8). The region with modeled RSM, as shown in Figure 8c, also had ET value, and the area of 0.541 × 10 4 km 2 with ET but without an RSM value should be associated with other unavailable variables (e.g., DL) at that time. Pixels with a DL below 0 °C were removed to avoid freezing temperatures in winter [104]. Therefore, a limitation of the SMLR model, to some extent, might be the fact that the availability of independent variables directly affects the coverage of the modeled RSM. According to the outcomes of the validation, for six months (December, January, February, April, May, and November), the Adj. R 2 exceeded 0.800 in 2017. These results proved the effectiveness of the SMLR method throughout the year in the study area [67,69]. This suggests that the SMLR method is a promising approach to estimate RSM. To retrieve RSM in each area and period using the SMLR method, the selected variables and coefficients of these variables only need to be updated. From a practical point of view, this is a significant finding, as it supports the use of multisource data to complement and/or replace coarse resolution satellite imagery in the simulation of RSM.
However, the methods used in the present study induce uncertainties. Only 17 features were used because of data limitations and hydrological parameters (e.g., runoff and irrigation activities), and other soil properties (e.g., soil porosity, bulk density, and soil organic matter) were ignored since these were difficult to obtain. There is no guarantee that the modeling outcomes will become better if more variables are collected. Certain variables could be removed when a new variable is added because of the issue of multicollinearity and the significance level. For regions where the availability of input variables is restricted, the simplest model, which includes soil texture and organic carbon, might be an alternative for estimating the water availability in the soil [58]. In addition, because the performance of the SMLR model was sensitive to the quality of the reference data as well as input data, the study was limited by the accuracy of the used reference RSM. Moreover, the good quality of each pixel for the inputs would improve the accuracy of the modeled RSM. The MODIS inputs (except for the land cover) were composite data over each 8-day period and could be affected by clouds or atmospheric interference. Besides ensuring each pixel value within the valid range, quality control According to the outcomes of the validation, for six months (December, January, February, April, May, and November), the Adj. R 2 exceeded 0.800 in 2017. These results proved the effectiveness of the SMLR method throughout the year in the study area [67,69]. This suggests that the SMLR method is a promising approach to estimate RSM. To retrieve RSM in each area and period using the SMLR method, the selected variables and coefficients of these variables only need to be updated. From a practical point of view, this is a significant finding, as it supports the use of multisource data to complement and/or replace coarse resolution satellite imagery in the simulation of RSM.
However, the methods used in the present study induce uncertainties. Only 17 features were used because of data limitations and hydrological parameters (e.g., runoff and irrigation activities), and other soil properties (e.g., soil porosity, bulk density, and soil organic matter) were ignored since these were difficult to obtain. There is no guarantee that the modeling outcomes will become better if more variables are collected. Certain variables could be removed when a new variable is added because of the issue of multicollinearity and the significance level. For regions where the availability of input variables is restricted, the simplest model, which includes soil texture and organic carbon, might be an alternative for estimating the water availability in the soil [58]. In addition, because the performance of the SMLR model was sensitive to the quality of the reference data as well as input data, the study was limited by the accuracy of the used reference RSM. Moreover, the good quality of each pixel for the inputs would improve the accuracy of the modeled RSM. The MODIS inputs (except for the land cover) were composite data over each 8-day period and could be affected by clouds or atmospheric interference. Besides ensuring each pixel value within the valid range, quality control procedures should be performed for each dataset using the quality flags in future studies [121,122].

Conclusions
Based on reference RSM (relative soil moisture (obtained in the previous study)), 34 candidate variables (12 quantitative variables and 22 dummy variables) from multisource data, including Moderate Resolution Imaging Spectroradiometer (MODIS) and topographic data, soil properties data, and meteorological data, were processed and modeled via stepwise multilinear regression (SMLR). After the accuracy assessment, monthly, seasonal, and annual spatial patterns of the modeled RSM were mapped and evaluated over the Chinese Loess Plateau (CLP) in 2017. The key findings and main conclusions are summarized in the following: • SMLR could model RSM with the desired accuracy (r = 0.969, RMSE = 0.761%, MAE = 0.576% in December) at a 500 m resolution over the CLP. Moreover, the use of multisource data to complement and/or replace coarse-resolution satellite data in RSM modeling could be considered.

•
The variables of elevation (0-500 m and 2000-2500 m), precipitation, soil texture of the loam, and nighttime land surface temperature could continuously be used in the regression models for all seasons in 2017.

•
Including dummy variables improved the model fit, both in calibration and validation, because the Adj. R 2 was higher with dummy variables than without.

•
The SMLR-modeled RSM for almost all periods except for January and February (because of unavailable ET data at that time) achieved better spatial coverage than that of the reference RSM. Thus, the availability of selected variables directly affects the coverage of the modeled RSM.
The SMLR-modeled RSM successfully characterized the spatiotemporal variability of RSM and agreed well with the reference RSM. The modeled RSM maps generated in this study are feasible for further study and can be regarded as in situ SM data. Further studies and validation of the presented SMLR models are advisable, which can be achieved by extending the investigation to other datasets, test areas, and data periods.        Figure A3. Selected variables for modeling at the monthly scale. Labels above and below 0 represent positive and negative directions of regression coefficients for selected variables in the regression models, respectively. Figure A4. Model validation (modeled RSM by SMLR compared with reference RSM) on the subset selected at the monthly time scale. Scores (Pearson's correlation coefficient (r), Adjusted R 2 (Adj. R 2 ), standard deviation (STD), root mean square error (RMSE), and mean absolute error (MAE)) were computed using data included in the corresponding subplot boundary. N represents the number of available RSM samples for each month. The associated p-values (in the subplots) with the correlation coefficients are all < 0.001. The colors of modeled RSM and linear fit are pink, light green, dark green, and orange representing winter, spring, summer, and autumn, respectively.   Appendix B Table A1. Information of the reference relative soil moisture (RSM) regarding the accuracy assessment against in situ RSM observation (station-based) as well as the mean RSM and the area of the reference RSM for each period.  Table A3. Mathematical expressions of the goodness of validation [11].