1. Introduction
Soil moisture, a key variable of regional and global climate systems, is important to understand the interaction between the land and the atmosphere. Changes in soil moisture have a considerable impact on climate change [
1]; hydrological processes, including precipitation, stream flow, and energy fluxes [
2,
3,
4,
5,
6,
7]; agricultural processes such as irrigation management and crop yield prediction [
8,
9]; and severe weather events such as droughts and heat waves [
10,
11,
12,
13,
14,
15,
16]. Therefore, it is important to monitor temporal and spatial patterns of soil moisture.
Soil moisture information has been provided by ground measurements at stations, remote sensing observations, and numerical models. In situ measurements provide accurate soil moisture data for specific locations with high temporal resolution (e.g., 30 min or 1 h). Global in situ soil moisture data can be acquired from the International Soil Moisture Network (ISMN;
http://www.ipf.tuwien.ac.at/insitu) [
17]. However, the cost is expensive and they do not provide information on the spatial distribution of soil moisture for vast remote areas. Satellite remote sensing-based approaches provide spatiotemporally continuous soil moisture data. Many satellites such as Advanced Microwave Scanning Radiometer 2 (AMSR2) [
18], Soil Moisture and Ocean Salinity sensor (SMOS) [
19,
20], the Advanced SCATterometer (ASCAT) [
21], and Soil Moisture Active Passive (SMAP) [
22] provide real time global soil moisture through passive microwaves with daily temporal resolution. However, remote sensing-based soil moisture has relatively coarse spatial resolution (10–40 km). In addition, the quality of satellite-derived soil moisture data depends on sensor characteristics and regional environmental factors (e.g., land cover, topography, and climate conditions). Spatiotemporally continuous global soil moisture data is also available from numerical models and reanalysis such as Global Land Data Assimilation System (GLDAS) [
23] and Modern-Era Retrospective Analysis for Research and Applications (MERRA) [
24]. In particular, reanalysis data provides more reliable soil moisture information than satellite-based soil moisture products [
25] and produces historical soil moisture data (e.g., from 1979) and various soil moisture products (e.g., 3 h, daily, and root zone soil moisture). However, there are several critical limitations, including that it is not possible to produce real-time soil moisture information from reanalysis data. In addition, it has very coarse spatial resolution (i.e., 0.25–1.0 degrees). For local and regional applications of soil moisture data on agriculture and water resources, such coarse resolution data is not particularly useful since it does not provide details on local variations in soil moisture [
26,
27]. Both microwave satellite sensor-derived soil moisture and reanalysis data have a common problem in that they have low spatial resolution; thus research efforts have been made to improve the spatial resolution of soil moisture data [
28,
29,
30,
31,
32,
33].
To improve the spatial resolution of soil moisture data, various downscaling approaches have been developed using satellite-derived products and numerical model-derived output. Although the SMAP radar sensor has failed to provide data, it originally planned to produce 9 km resolution soil moisture data by integrating active and passive microwave measurements at the L-band [
22]. AMSR2 provides soil moisture products at 10 km resolution spatially enhanced from the C-band brightness temperature data by applying the smoothing filter-based intensity modulation (SFIM) downscaling technique using the high resolution Ka-band measurements [
34]. Other downscaling approaches are based on the disaggregation of passive microwave soil moisture using high resolution optical/thermal sensor data [
32,
35,
36,
37,
38]. Optical/thermal data has been used to downscale soil moisture since the concept of the ‘universal triangle’ was introduced [
39,
40]. This concept explains the relationship between soil moisture, surface temperature, and vegetation indices [
27]. Many studies have conducted downscaling of soil moisture data using empirical regression models [
37,
38,
41,
42,
43,
44]. Merlin et al. [
29,
45] downscaled SMOS soil moisture to 1 km and 250 m resolution through a semi-empirical model, the DISaggregation based on Physical And Theoretical scale Change (DISPATCH) algorithm, which estimates soil moisture using Soil Evaporative Efficiency (SEE). However, these approaches have some limitations; a simple regression model is not able to estimate the complex behavior of soil moisture and the DISPATCH algorithm works well only when there is a large spatial variability of temperature [
29].
Recently, machine learning approaches have been applied in various remote sensing fields, including land cover classification [
46,
47,
48], drought monitoring [
49,
50], atmospheric process modelling [
49,
51], polar sea ice characterization [
52,
53], rainfall rate retrievals [
54], and biophysical parameter estimation [
55,
56]. Ahmad et al. [
57] estimated soil moisture from the Variable Infiltration Capacity Three Layer (VIC) model, radar backscattering, and incidence angle measurements from Tropical Rainfall Measuring Mission (TRMM) and Normalized Difference Vegetation Index (NDVI) from Advanced Very High Resolution Radiometer (AVHRR) based on the two machine learning approaches; Support Vector Machine (SVM) and Artificial Neural Network (ANN). Srivastava et al. [
58] conducted SMOS soil moisture downscaling using the MODerate Resolution Imaging Spectroradiometer (MODIS) Land Surface Temperature (LST) through SVM, Relevance Vector Machine, ANN, and Generalized Linear Model (GLM). Im et al. [
32] downscaled AMSR-E soil moisture using MODIS LST, NDVI, Enhanced Vegetation Index (EVI), Leaf Area Index (LAI), Evapotranspiration (ET), and albedo through rule-based machine learning approaches, including random forest, Cubist, and boosted regression trees.
Most of the studies mentioned above downscaled single sensor-derived soil moisture such as SMOS and AMSR-E. However, each sensor has different specifications, and the derived soil moisture heavily depends on the site characteristics under investigation [
59]. There is no single satellite-derived soil moisture product that is the most accurate all over the globe. GLDAS soil moisture is regarded as the reference soil moisture for many applications in the literature [
60,
61,
62]. Since GLDAS estimates soil moisture using several land surface models through data assimilation of in situ and satellite observations and model-derived data [
63], GLDAS soil moisture has been used to validate satellite-derived soil moisture at various spatial scales as well as in situ soil moisture measurements [
60,
61,
62,
63,
64]. Thus, this study considers GLDAS soil moisture as a reference dataset and downscaled it throughout East Asia through multi-sensor data fusion from an operational perspective.
In this work, we downscaled GLDAS soil moisture by integrating satellite-derived soil moisture products (ASCAT and AMSR2) and high resolution (1 km) optical/thermal sensor data, including LST, NDVI, land cover, and digital elevation models (DEM) based on machine learning. The objectives of this study are to (1) develop a soil moisture downscaling model by optimizing a modified regression tree; (2) produce high quality soil moisture products throughout East Asia by integrating microwave soil moisture and auxiliary optical/thermal sensor products with 1 km spatial resolution; and (3) compare and evaluate downscaled soil moisture using GLDAS soil moisture and in situ soil moisture measurements at 14 ground stations to examine its appropriateness as a real-time high resolution soil moisture product.
3. Methodology
A total of six input variables, ASCAT soil moisture, AMSR2 soil moisture, MODIS LST, NDVI, and Land Cover, and SRTM DEM, were used for simulation of the GLDAS soil moisture to develop a machine learning-based soil moisture downscaling algorithm. Although Tropical Rain Measuring Mission (TRMM) precipitation was originally considered as an input variable in this study, it was excluded based on the preliminary results, which did not produce improvement in performance (not shown). In addition, TRMM data was not available for the northern part (>50 degrees) of the study region. The high uncertainty of TRMM precipitation over high latitudes (>40 degrees) may be the reason for its poor contribution to the soil moisture downscaling.
Figure 2 shows the process flow diagram proposed in the study. First, MODIS products and SRTM DEM were aggregated to the same grid size with GLDAS soil moisture (25 km) using a mean function. Six inputs at 25 km grid size from 2013 to 2015 (i.e., daily except for DEM and Land Cover) were extracted based on 602 point locations that were selected after considering soil type, land cover, and DEM distribution throughout East Asia. The spatial distribution of the selected points and their characteristics in terms of the three considerations (i.e., soil type, land cover, and DEM) are summarized in
Appendix A Figure A1 and
Table A2. Although AMSR2, ASCAT, and GLDAS provide daily products, the MODIS LST and NDVI were provided with eight-day and 16-day intervals, respectively. The same values of MODIS LST and NDVI were used during the intervals corresponding to daily products. A total of 36,412 samples for clear sky days were used to develop the downscaling algorithm. The samples from 2013 to 2014 were used as training data (
n = 20,787), and validation was conducted using the samples in 2015 (
n = 15,625). This hindcast validation approach is commonly used in the operational applications of satellite remote sensing, especially for meteorological applications [
51,
78,
79,
80,
81]. Six independent variables, ASCAT, AMSR2, MODIS, and SRTM products, and the dependent variable of GLDAS soil moisture were fed into machine learning (dotted lines in
Figure 2).
We adopted a modified regression tree from Cubist after considering the performance and operational use of the approach based on our previous study [
32]. Although random forest proved to be very robust in many remote sensing applications [
82,
83,
84,
85] and produced slightly better performance in Im et al. [
32], it requires a much longer processing time than a modified regression tree, i.e., Cubist, which is not appropriate for operational use. Cubist regression trees developed by RuleQuest Research have been widely used in the remote sensing field [
32,
49,
86,
87,
88]. Cubist regression trees consider the nonlinear relationships between independent and dependent variables for modeling, and both continuous and discrete variables are allowed as input [
89]. Tree output from the Cubist approach consists of rules and multivariate regression associated with each rule to estimate the dependent variable, which is straightforward and interpretable. Thus, it overcomes the limitations of simple linear models [
90]. Relative variable importance in Cubist models can be identified based on the percentage of variable usage in rules and regression models. Rules can be generated up to 500 in Cubist models, and the number of rules is controllable using a pruning approach by limiting the maximum number of rules. Cubist regression trees generate the optimum number of rules that is less than the maximum number of rules specified by the user. In this study, the number of rules was optimized based on the pruning approach using Root Mean Square Error (RMSE) and correlation coefficients (
r). Finally, an optimized regression tree to estimate GLDAS soil moisture was determined. It is relatively easy to understand the physical meanings of resultant rules, and this approach has shorter operation time than other machine learning approaches such as random forest.
Since the spatial resolution of AMSR2 and ASCAT soil moisture products is 25 km, they were resampled to a 1 km grid size simply by using the triangle-based linear interpolation in MATLAB, commonly used for the resampling of gridded data. We expected that the performance of the soil moisture downscaling model could be improved by incorporating AMSR2 and ASCAT soil moisture data, which provide basic information on soil moisture in spite of their coarse resolution, because our study area (East Asia) is wide and heterogeneous in terms of topography, land cover, and climate conditions. In order to evaluate the model performance, r, RMSE, root-mean-squared difference (RMSD), relative RMSE (rRMSE), or relative RMSD (rRMSD) were used. Downscaled 1 km soil moisture data was quantitatively compared with the in situ soil moisture data.