Downscaling GLDAS Soil Moisture Data in East Asia through Fusion of Multi-Sensors by Optimizing Modified Regression Trees

Soil moisture is a key part of Earth’s climate systems, including agricultural and hydrological cycles. Soil moisture data from satellite and numerical models is typically provided at a global scale with coarse spatial resolution, which is not enough for local and regional applications. In this study, a soil moisture downscaling model was developed using satellite-derived variables targeting Global Land Data Assimilation System (GLDAS) soil moisture as a reference dataset in East Asia based on the optimization of a modified regression tree. A total of six variables, Advanced Microwave Scanning Radiometer 2 (AMSR2) and Advanced SCATterometer (ASCAT) soil moisture products, Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM), and MODerate resolution Imaging Spectroradiometer (MODIS) products, including Land Surface Temperature, Normalized Difference Vegetation Index, and land cover, were used as input variables. The optimization was conducted through a pruning approach for operational use, and finally 59 rules were extracted based on root mean square errors (RMSEs) and correlation coefficients (r). The developed downscaling model showed a good modeling performance (r = 0.79, RMSE = 0.056 m3·m−3, and slope = 0.74). The 1 km downscaled soil moisture showed similar time series patterns with both GLDAS and ground soil moisture and good correlation with ground soil moisture (average r = 0.47, average RMSD = 0.038 m3·m−3) at 14 ground stations. The spatial distribution of 1 km downscaled soil moisture reflected seasonal and regional characteristics well, although the model did not result in good performance over a few areas such as Southern China due to very high cloud cover rates. The results of this study are expected to be helpful in operational use to monitor soil moisture throughout East Asia since the downscaling model produces daily high resolution (1 km) real time soil moisture with a low computational demand. This study yielded a promising result to operationally produce daily high resolution soil moisture data from multiple satellite sources, although there are yet several limitations. In future research, more variables including Global Precipitation Measurement (GPM) precipitation, Soil Moisture Active Passive (SMAP) soil moisture, and other vegetation indices will be integrated to improve the performance of the proposed soil moisture downscaling model.


Introduction
Soil moisture, a key variable of regional and global climate systems, is important to understand the interaction between the land and the atmosphere.Changes in soil moisture have a considerable impact on climate change [1]; hydrological processes, including precipitation, stream flow, and energy fluxes [2][3][4][5][6][7]; agricultural processes such as irrigation management and crop yield prediction [8,9]; and severe weather events such as droughts and heat waves [10][11][12][13][14][15][16].Therefore, it is important to monitor temporal and spatial patterns of soil moisture.
Soil moisture information has been provided by ground measurements at stations, remote sensing observations, and numerical models.In situ measurements provide accurate soil moisture data for specific locations with high temporal resolution (e.g., 30 min or 1 h).Global in situ soil moisture data can be acquired from the International Soil Moisture Network (ISMN; http://www.ipf.tuwien.ac.at/insitu) [17].However, the cost is expensive and they do not provide information on the spatial distribution of soil moisture for vast remote areas.Satellite remote sensing-based approaches provide spatiotemporally continuous soil moisture data.Many satellites such as Advanced Microwave Scanning Radiometer 2 (AMSR2) [18], Soil Moisture and Ocean Salinity sensor (SMOS) [19,20], the Advanced SCATterometer (ASCAT) [21], and Soil Moisture Active Passive (SMAP) [22] provide real time global soil moisture through passive microwaves with daily temporal resolution.However, remote sensing-based soil moisture has relatively coarse spatial resolution (10-40 km).In addition, the quality of satellite-derived soil moisture data depends on sensor characteristics and regional environmental factors (e.g., land cover, topography, and climate conditions).Spatiotemporally continuous global soil moisture data is also available from numerical models and reanalysis such as Global Land Data Assimilation System (GLDAS) [23] and Modern-Era Retrospective Analysis for Research and Applications (MERRA) [24].In particular, reanalysis data provides more reliable soil moisture information than satellite-based soil moisture products [25] and produces historical soil moisture data (e.g., from 1979) and various soil moisture products (e.g., 3 h, daily, and root zone soil moisture).However, there are several critical limitations, including that it is not possible to produce real-time soil moisture information from reanalysis data.In addition, it has very coarse spatial resolution (i.e., 0.25-1.0degrees).For local and regional applications of soil moisture data on agriculture and water resources, such coarse resolution data is not particularly useful since it does not provide details on local variations in soil moisture [26,27].Both microwave satellite sensor-derived soil moisture and reanalysis data have a common problem in that they have low spatial resolution; thus research efforts have been made to improve the spatial resolution of soil moisture data [28][29][30][31][32][33].
To improve the spatial resolution of soil moisture data, various downscaling approaches have been developed using satellite-derived products and numerical model-derived output.Although the SMAP radar sensor has failed to provide data, it originally planned to produce 9 km resolution soil moisture data by integrating active and passive microwave measurements at the L-band [22].AMSR2 provides soil moisture products at 10 km resolution spatially enhanced from the C-band brightness temperature data by applying the smoothing filter-based intensity modulation (SFIM) downscaling technique using the high resolution Ka-band measurements [34].Other downscaling approaches are based on the disaggregation of passive microwave soil moisture using high resolution optical/thermal sensor data [32,[35][36][37][38]. Optical/thermal data has been used to downscale soil moisture since the concept of the 'universal triangle' was introduced [39,40].This concept explains the relationship between soil moisture, surface temperature, and vegetation indices [27].Many studies have conducted downscaling of soil moisture data using empirical regression models [37,38,[41][42][43][44]. Merlin et al. [29,45] downscaled SMOS soil moisture to 1 km and 250 m resolution through a semi-empirical model, the DISaggregation based on Physical And Theoretical scale Change (DISPATCH) algorithm, which estimates soil moisture using Soil Evaporative Efficiency (SEE).However, these approaches have some limitations; a simple regression model is not able to estimate the complex behavior of soil moisture and the DISPATCH algorithm works well only when there is a large spatial variability of temperature [29].
Most of the studies mentioned above downscaled single sensor-derived soil moisture such as SMOS and AMSR-E.However, each sensor has different specifications, and the derived soil moisture heavily depends on the site characteristics under investigation [59].There is no single satellite-derived soil moisture product that is the most accurate all over the globe.GLDAS soil moisture is regarded as the reference soil moisture for many applications in the literature [60][61][62].Since GLDAS estimates soil moisture using several land surface models through data assimilation of in situ and satellite observations and model-derived data [63], GLDAS soil moisture has been used to validate satellite-derived soil moisture at various spatial scales as well as in situ soil moisture measurements [60][61][62][63][64]. Thus, this study considers GLDAS soil moisture as a reference dataset and downscaled it throughout East Asia through multi-sensor data fusion from an operational perspective.
In this work, we downscaled GLDAS soil moisture by integrating satellite-derived soil moisture products (ASCAT and AMSR2) and high resolution (1 km) optical/thermal sensor data, including LST, NDVI, land cover, and digital elevation models (DEM) based on machine learning.The objectives of this study are to (1) develop a soil moisture downscaling model by optimizing a modified regression tree; (2) produce high quality soil moisture products throughout East Asia by integrating microwave soil moisture and auxiliary optical/thermal sensor products with 1 km spatial resolution; and (3) compare and evaluate downscaled soil moisture using GLDAS soil moisture and in situ soil moisture measurements at 14 ground stations to examine its appropriateness as a real-time high resolution soil moisture product.

Study Area
The study area is East Asia (latitude: 10.17   E), including east China, southeast Russia, Taiwan, Korea, and Japan (Figure 1).East Asia frequently suffers from floods (typically from June to August) and droughts (typically from March to May) due to the climatic characteristics of the region such as monsoons.East Asia has generally hot and humid weather conditions in summer, while it is dry and cold in winter.Climatic characteristics such as mean temperature and precipitation are slightly different from country to country (in particular by latitude).The annual mean temperature is about 15 • C in East China, 12 • C on the Korean peninsula, and 16 • C in Japan.Figure 1 shows the land cover distribution of the study area.East China consists of forest, cropland, savannas, grassland, and barren areas, and the Korean peninsula and Japan are mostly composed of forest and cropland.There are 15 soil types in the study area.While most of these different soil types are found in east China, only two (i.e., leptosols and acrisols) exist in Korea and three (i.e., leptosols, acrisols, and andosols) in Japan.  2 for information on ground stations (a through n).

Soil Moisture
The AMSR2 instrument on the Global Change Observing Mission-Water (GCOM-W) satellite launched in 2012, extends the legacy of AMSR-E.Compared to AMSR-E, AMSR2 has improved characteristics such as mitigating radio frequency interference (RFI) using an additional channel (C-band frequency), higher reliability, and an improved calibration system [64][65][66][67].The AMSR2 C-band-derived daily soil moisture product provided by the Japan Aerospace Exploration Agency (JAXA) at 0.1 and 0.25 degrees spatial resolution by percent of volumetric water ranging from 0 to 40% was used in this study.The product is retrieved by calculating the Polarization Index (PI) and the Index of Soil Wetness (ISW) using 10 and 36 GHz brightness temperature defined by Equations ( 1) and (2) based on a look-up-table approach [18].
where Tb V and Tb H indicate the brightness temperature of the vertical and horizontal polarizations and i and jare high and low frequencies, respectively.Global AMSR2 soil moisture data (2013 to 2015) were obtained from GCOM-W1 Data Providing Service (https://gcom-w1.jaxa.jp/auth.html).Daily data was calculated by averaging soil moisture in ascending and descending modes.When there were missing pixels, soil moisture data collected the day before was used to solve the no-data problem.
That way almost all missing pixels (>99%) were filled for AMSR2 soil moisture.ASCAT on the Meteorological operational satellite-A (MetOp-A) satellite is a real aperture radar sensor measuring radar backscatter using the C-band for monitoring wind over the oceans, soil moisture, and vegetation [68].ASCAT soil moisture data sensed by C-band (5.255 GHz) microwaves from 2013 to 2015 were obtained from the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT; http://www.eumetsat.int).The soil moisture was calculated from backscattered data by using Equation (3) [69,70].
where m s is surface soil moisture, σ 0 is the backscattering value at present, and σ wet and σ dry are the backscattering values at dry and wet conditions, respectively.ASCAT soil moisture data is provided at the spatial resolution of 25 km and 12.5 km with the percent values from 0 to 100 (0% means dry, 100% means wet).In this study, daily data was produced by averaging swath data (in both ascending and descending modes).Similar to AMSR2, ASCAT soil moisture data collected up to two days before was used to fill no-data pixels, if any.Unlike AMSR2, ASCAT soil moisture has many more missing values, which require up to three days of soil moisture to fill the gaps.Although ASCAT and AMSR2 provide coarse resolution data (25 km), they were used to produce downscaled soil moisture.It is expected that the use of ASCAT and AMSR2 soil moisture may improve the accuracy of the proposed downscaling models as the daily products already contain the regional characteristics of soil moisture.

Other Input Parameters
MODIS is an instrument onboard Terra and Aqua satellites, which has been widely used for various environmental monitoring applications on both regional and global scales.The eight-day LST (MOD11A2) [71], 16-day NDVI (MOD13A2) [72], and Land cover (MCD12Q1) [73] Terra ascending data (10:30 am) were used in this study (Table 1).While LST and NDVI are provided a 1 km resolution, the land cover product (MCD12Q1) has a spatial resolution of 500 m.The land cover data with seventeen classes was resampled to 1 km using a majority filter, and then the similar classes were aggregated to nine classes.Since the accuracy of MODIS land cover is not high for all classes especially for vegetation (e.g., Forest, Shrublands, and Savannas) in East Asia, we used representative land covers through the aggregation of similar classes (refer to the Appendix A Table A1).A total of 24 tiles (h23v03 to h30v07) covering the study area from 2013 to 2015 were obtained from the reverb echo site (http://reverb.echo.nasa/gov/reverb).The Shuttle Radar Topography Mission (SRTM) [74] was flown on the Space Shuttle mission Endeavour STS-99, which had C-band Spaceborne Imaging Radar and X-band Synthetic Aperture Radar (X-SAR) hardware.A near-global Digital Elevation Model (DEM) was obtained using the interferometric processing of single pass data [75].SRTM DEM data is provided at 30 m and 90 m resolution from the United States Geological Survey (USGS) Elevation Products site (http://eros.usgs.gov/elevation-products).In this study, 90 m DEM data was used and resampled using a mean function to 1 km; the same as MODIS products.

GLDAS Soil Moisture
GLDAS has been developed to identify land surface states and fluxes using data assimilation techniques and consists of three land surface models; Mosaic, Noah, and the Community Land Model (CLM) [25,76].GLDAS soil moisture data from 2013 to 2015 was archived from Goddard Earth Sciences Data and Information Services Center (http://disc.sci.gsfc.nasa.gov/hydrology/dataholdings).In this study, three-hourly GLDAS Noah Land Surface Model (LSM) data at the spatial resolution of 0.25 degrees was used because GLDAS Noah LSM provides higher resolution data among GLDAS soil moisture products.Specifically layer 1 (i.e., 1-10 cm) soil moisture data was used because satellite-derived soil moisture involves only top soil moisture (1-5 cm).Since daily soil moisture data is not provided, daily soil moisture was calculated by averaging three-hourly data.

Ground Soil Moisture
Ground soil moisture data at 14 stations in South Korea were obtained from the Korea Rural Development Administration (RDA; http://weather.rda.go.kr/).RDA provides hourly ground soil moisture data in percentage at 10 cm depth using Time Domain Reflectometry (TDR) which is based on the relation between dielectric properties of soils and moisture levels [77].Table 2 shows information on about 14 stations such as location, elevation, and land cover.Most of the stations are located in cropland, and the soil types are mainly sandy loam, clay, and clay loam.In this study, daily soil moisture data was calculated by averaging hourly data at each station to evaluate downscaled GLDAS and satellite-derived soil moisture data.

Methodology
A total of six input variables, ASCAT soil moisture, AMSR2 soil moisture, MODIS LST, NDVI, and Land Cover, and SRTM DEM, were used for simulation of the GLDAS soil moisture to develop a machine learning-based soil moisture downscaling algorithm.Although Tropical Rain Measuring Mission (TRMM) precipitation was originally considered as an input variable in this study, it was excluded based on the preliminary results, which did not produce improvement in performance (not shown).In addition, TRMM data was not available for the northern part (>50 degrees) of the study region.The high uncertainty of TRMM precipitation over high latitudes (>40 degrees) may be the reason for its poor contribution to the soil moisture downscaling.Figure 2 shows the process flow diagram proposed in the study.First, MODIS products and SRTM DEM were aggregated to the same grid size with GLDAS soil moisture (25 km) using a mean function.Six inputs at 25 km grid size from 2013 to 2015 (i.e., daily except for DEM and Land Cover) were extracted based on 602 point locations that were selected after considering soil type, land cover, and DEM distribution throughout East Asia.The spatial distribution of the selected points and their characteristics in terms of the three considerations (i.e., soil type, land cover, and DEM) are summarized in Appendix A Figure A1 and Table A2.Although AMSR2, ASCAT, and GLDAS provide daily products, the MODIS LST and NDVI were provided with eight-day and 16-day intervals, respectively.The same values of MODIS LST and NDVI were used during the intervals corresponding to daily products.A total of 36,412 samples for clear sky days were used to develop the downscaling algorithm.The samples from 2013 to 2014 were used as training data (n = 20,787), and validation was conducted using the samples in 2015 (n = 15,625).This hindcast validation approach is commonly used in the operational applications of satellite remote sensing, especially for meteorological applications [51,[78][79][80][81]. Six independent variables, ASCAT, AMSR2, MODIS, and SRTM products, and the dependent variable of GLDAS soil moisture were fed into machine learning (dotted lines in Figure 2).
region.The high uncertainty of TRMM precipitation over high latitudes (>40 degrees) may be the reason for its poor contribution to the soil moisture downscaling.Figure 2 shows the process flow diagram proposed in the study.First, MODIS products and SRTM DEM were aggregated to the same grid size with GLDAS soil moisture (25 km) using a mean function.Six inputs at 25 km grid size from 2013 to 2015 (i.e.daily except for DEM and Land Cover) were extracted based on 602 point locations that were selected after considering soil type, land cover, and DEM distribution throughout East Asia.The spatial distribution of the selected points and their characteristics in terms of the three considerations (i.e.soil type, land cover, and DEM) are summarized in supplementary Figure 1 and Table 2.Although AMSR2, ASCAT, and GLDAS provide daily products, the MODIS LST and NDVI were provided with eight-day and 16-day intervals, respectively.The same values of MODIS LST and NDVI were used during the intervals corresponding to daily products.A total of 36,412 samples for clear sky days were used to develop the downscaling algorithm.The samples from 2013 to 2014 were used as training data (n = 20,787), and validation was conducted using the samples in 2015 (n = 15,625).This hindcast validation approach is commonly used in the operational applications of satellite remote sensing, especially for meteorological applications [51,[78][79][80][81]. Six independent variables, ASCAT, AMSR2, MODIS, and SRTM products, and the dependent variable of GLDAS soil moisture were fed into machine learning (dotted lines in Figure 2).We adopted a modified regression tree from Cubist after considering the performance and operational use of the approach based on our previous study [32].Although random forest proved to be very robust in many remote sensing applications [82][83][84][85] and produced slightly better performance in Im et al. [32], it requires a much longer processing time than a modified regression tree, i.e.Cubist, which is not appropriate for operational use.Cubist regression trees developed by RuleQuest Research have been widely used in the remote sensing field [32,49,[86][87][88].Cubist regression trees consider the nonlinear relationships between independent and dependent variables for modeling, and both continuous and discrete variables are allowed as input [89].Tree output from the Cubist approach consists of rules and multivariate regression associated with each rule to estimate the dependent variable, which is straightforward and interpretable.Thus, it overcomes the limitations of simple linear models [90].Relative variable importance in Cubist models can be identified based on the percentage of variable usage in rules and regression models.Rules can be generated up to 500 in Cubist models, and the number of rules is controllable using a pruning approach by limiting the maximum number of rules.Cubist regression trees generate the optimum We adopted a modified regression tree from Cubist after considering the performance and operational use of the approach based on our previous study [32].Although random forest proved to be very robust in many remote sensing applications [82][83][84][85] and produced slightly better performance in Im et al. [32], it requires a much longer processing time than a modified regression tree, i.e., Cubist, which is not appropriate for operational use.Cubist regression trees developed by RuleQuest Research have been widely used in the remote sensing field [32,49,[86][87][88].Cubist regression trees consider the nonlinear relationships between independent and dependent variables for modeling, and both continuous and discrete variables are allowed as input [89].Tree output from the Cubist approach consists of rules and multivariate regression associated with each rule to estimate the dependent variable, which is straightforward and interpretable.Thus, it overcomes the limitations of simple linear models [90].Relative variable importance in Cubist models can be identified based on the percentage of variable usage in rules and regression models.Rules can be generated up to 500 in Cubist models, and the number of rules is controllable using a pruning approach by limiting the maximum number of rules.Cubist regression trees generate the optimum number of rules that is less than the maximum number of rules specified by the user.In this study, the number of rules was optimized based on the pruning approach using Root Mean Square Error (RMSE) and correlation coefficients (r).Finally, an optimized regression tree to estimate GLDAS soil moisture was determined.It is relatively easy to understand the physical meanings of resultant rules, and this approach has shorter operation time than other machine learning approaches such as random forest.
Since the spatial resolution of AMSR2 and ASCAT soil moisture products is 25 km, they were resampled to a 1 km grid size simply by using the triangle-based linear interpolation in MATLAB, commonly used for the resampling of gridded data.We expected that the performance of the soil moisture downscaling model could be improved by incorporating AMSR2 and ASCAT soil moisture data, which provide basic information on soil moisture in spite of their coarse resolution, because our study area (East Asia) is wide and heterogeneous in terms of topography, land cover, and climate conditions.In order to evaluate the model performance, r, RMSE, root-mean-squared difference (RMSD), relative RMSE (rRMSE), or relative RMSD (rRMSD) were used.Downscaled 1 km soil moisture data was quantitatively compared with the in situ soil moisture data.

Model Optimization
The maximum number of rules from the fully grown regression tree generated in this study was 329.The optimization of the number of rules was conducted using the validation dataset (n = 15,625) based on the accuracy metrics, RMSE and r.A smaller numbers of rules was produced through the pruning process.Figure 3 shows the change of the RMSE and r values with the decreasing number of rules.As expected, the larger number of rules produced the lower RMSE and the higher r.However, there was no significant difference in RMSE and r for the relatively large numbers of rules (≥59).For numbers of rules smaller than 59, RMSE dramatically increased with the decreasing number of rules.As the number of rules decreased, most of rules became simplified and aggregated into smaller numbers of rules.In this study, we determined 59 to be the optimal number of rules.Each rule is associated with a multivariate regression model, which has been commonly used in the literature [37,38,41,42].The modified regression tree (Figure 3; RMSE = 0.06 m 3 •m −3 , r = 0.8) showed better modeling performance than the single multiple linear regression model (Figure 3; RMSE = 0.07 m 3 •m −3 , r = 0.6).
Water 2017, 9, 332 8 of 24 number of rules that is less than the maximum number of rules specified by the user.In this study, the number of rules was optimized based on the pruning approach using Root Mean Square Error (RMSE) and correlation coefficients (r).Finally, an optimized regression tree to estimate GLDAS soil moisture was determined.It is relatively easy to understand the physical meanings of resultant rules, and this approach has shorter operation time than other machine learning approaches such as random forest.
Since the spatial resolution of AMSR2 and ASCAT soil moisture products is 25 km, they were resampled to a 1 km grid size simply by using the triangle-based linear interpolation in MATLAB, commonly used for the resampling of gridded data.We expected that the performance of the soil moisture downscaling model could be improved by incorporating AMSR2 and ASCAT soil moisture data, which provide basic information on soil moisture in spite of their coarse resolution, because our study area (East Asia) is wide and heterogeneous in terms of topography, land cover, and climate conditions.In order to evaluate the model performance, r, RMSE, root-mean-squared difference (RMSD), relative RMSE (rRMSE), or relative RMSD (rRMSD) were used.Downscaled 1 km soil moisture data was quantitatively compared with the in situ soil moisture data.

Model Optimization
The maximum number of rules from the fully grown regression tree generated in this study was 329.The optimization of the number of rules was conducted using the validation dataset (n = 15,625) based on the accuracy metrics, RMSE and r.A smaller numbers of rules was produced through the pruning process.Figure 3 shows the change of the RMSE and r values with the decreasing number of rules.As expected, the larger number of rules produced the lower RMSE and the higher r.However, there was no significant difference in RMSE and r for the relatively large numbers of rules (≥59).For numbers of rules smaller than 59, RMSE dramatically increased with the decreasing number of rules.As the number of rules decreased, most of rules became simplified and aggregated into smaller numbers of rules.In this study, we determined 59 to be the optimal number of rules.Each rule is associated with a multivariate regression model, which has been commonly used in the literature [37,38,41,42].The modified regression tree (Figure 3; RMSE = 0.06 m 3 •m −3 , r = 0.8) showed better modeling performance than the single multiple linear regression model (Figure 3; RMSE = 0.07 m 3 •m −3 , r = 0.6).Table 3 summarizes the sub-models selected from the optimized regression tree results (i.e., 59 rule-based sub-models), which covered the majority of sample cases to downscale GLDAS soil moisture in this study.Each sub-model consists of a rule and its associated multivariate regression model.Since land cover is a discrete variable, land cover was used only for the rules.East Asia has various geophysical characteristics in terms of topography, land cover, and seasonal climate conditions.The downscaling model considered such geographical and seasonal characteristics in the rules in that the elevation (DEM), surface temperature (LST), land cover type, and vegetation healthiness (NDVI) were used to identify geographical and seasonal characteristics.Each rule provides specific conditions with thresholds so that the corresponding multivariate regression can be applied.For example, rule 1 estimated dry soil moisture (mean = 0.13 m 3 •m −3 ) in a barren area with high LST, while rule 40 was used to estimate wet soil moisture (mean = 0.29 m 3 •m −3 ) in an area of vegetation (forest, shrublands, savannas, and cropland) with low LST and high NDVI.In this case, rule 2 (mean = 0.14 m 3 •m −3 ) and rule 3 (mean = 0.16 m 3 •m −3 ) have the same conditions for land cover and LST, but they have different conditions for ASCAT, DEM, and NDVI.Rule 3 was developed to estimate slightly wetter soil moisture than rule 2, so the condition of ASCAT soil moisture (ASCAT > 0.1713) in rule 3 is higher than in rule 2 (ASCAT ≤ 0.1713).Figure 4 depicts the modeling results (i.e., both calibration and validation) of the optimized regression tree (59 rules) that compare the predicted soil moisture with the GLDAS soil moisture.Calibration and validation were conducted using the training dataset (2013-2014) and the test dataset (2015), respectively.The modeling performance was good in both calibration (r = 0.87, RMSE = 0.048 m 3 •m −3 , and slope = 0.77) and validation (r = 0.79, RMSE = 0.056 m 3 •m −3 , and slope = 0.74), although the validation results were slightly poorer than those in the calibration.The proposed downscaling model seems to underestimate GDLAS soil moisture.
Table 4 shows the attribute usage information of the six variables in the rules and regression models.Land cover, DEM, and LST show high usage (~80-90%) in the rules because they are important variables to distinguish regional and seasonal characteristics of soil moisture in East Asia.Five variables, except for land cover, were evenly used in the regression models, with the usage ranging from 74 to 97%.It is not surprising that DEM, LST, land cover, and NDVI show high variable importance since such information is integrated when producing the GLDAS soil moisture.It is surprising, though, that the soil moisture products from ASCAT and AMSR2 were not more frequently used in the rules than the other variables to estimate GLDAS soil moisture.This implies that ASCAT and AMSR2 soil moisture algorithms might not be able to effectively consider regional or seasonal characteristics in East Asia.When ASCAT and AMSR2 soil moisture data were compared to GLDAS soil moisture, ASCAT data tended to overestimate soil moisture, while AMSR2 data significantly underestimated soil moisture throughout the study area.This may explain the low usage of the ASCAT and AMSR2 data (especially AMSR2) to simulate GLDAS soil moisture in both the rules and regression models that resulted from the modified regression tree.Table 4 shows the attribute usage information of the six variables in the rules and regression models.Land cover, DEM, and LST show high usage (~80-90%) in the rules because they are important variables to distinguish regional and seasonal characteristics of soil moisture in East Asia.Five variables, except for land cover, were evenly used in the regression models, with the usage ranging from 74 to 97%.It is not surprising that DEM, LST, land cover, and NDVI show high variable importance since such information is integrated when producing the GLDAS soil moisture.It is surprising, though, that the soil moisture products from ASCAT and AMSR2 were not more frequently used in the rules than the other variables to estimate GLDAS soil moisture.This implies that ASCAT and AMSR2 soil moisture algorithms might not be able to effectively consider regional or seasonal characteristics in East Asia.When ASCAT and AMSR2 soil moisture data were compared to GLDAS soil moisture, ASCAT data tended to overestimate soil moisture, while AMSR2 data significantly underestimated soil moisture throughout the study area.This may explain the low usage of the ASCAT and AMSR2 data (especially AMSR2) to simulate GLDAS soil moisture in both the rules and regression models that resulted from the modified regression tree.

Model Evaluation
Figure 5 shows the time series of ground soil moisture, GLDAS soil moisture, and 1 km downscaled soil moisture with precipitation at 14 stations during growing season (from May to September) in 2015.Since the 1 km downscaled soil moisture cannot be produced under cloudy days

Model Evaluation
Figure 5 shows the time series of ground soil moisture, GLDAS soil moisture, and 1 km downscaled soil moisture with precipitation at 14 stations during growing season (from May to September) in 2015.Since the 1 km downscaled soil moisture cannot be produced under cloudy days due to missing data for input variables, there are some gaps in Figure 5.The 1 km downscaled soil moisture, as well as GLDAS soil moisture, show a similar temporal pattern to ground soil moisture.Soil moisture increased with increasing rainfall.Soil moisture is generally low in the dry season and high in the wet season.However, GLDAS and the 1 km downscaled soil moisture tend to be underestimated when compared to ground soil moisture, as discussed by Zhang et al. [76], due to the difference between the depth of GLDAS and the 1 km downscaled soil moisture (~5 cm) and ground soil moisture (10 cm).It should also be noted that the spatial scales are quite different among the three types of soil moisture data: ground soil moisture was measured at point locations, while GLDAS and 1 km downscaled soil moisture data were observed over 25 km × 25 km and 1 km × 1 km grids, respectively.Thus, while ground soil moisture fluctuates highly, GLDAS soil moisture data does not relatively show extreme values.As discussed in Choi et al. [37], although most RDA sites are located in cropland, the corresponding domains of AMSR-E (25 km) and MODIS (1 km) for each site consist of more heterogeneous land cover types (e.g., cropland, built-up, barren land, and forest) because the RDA sites were not initially designed to validate remote sensing soil moisture.On the other hand, many validation sites in previous studies were designed considering remote sensing validation and consist of homogeneous land cover within remote sensing pixels such as OZnet [91,92].The 1 km downscaled soil moisture was compared to the ground soil moisture using scatterplots at 14 stations from May to September 2015 (Figure 6).Since ground soil moisture has a spatial scale different from the downscaled one, ground soil moisture within each grid (i.e, 1 km × 1 km) was assumed to be consistent [32].The comparison between the ground and downscaled soil moisture data varied by station resulting in mean slope ~ 0.987, mean RMSD ~ 0.041 m 3 •m −3 , and mean r ~ 0.53 The 1 km downscaled soil moisture was compared to the ground soil moisture using scatterplots at 14 stations from May to September 2015 (Figure 6).Since ground soil moisture has a spatial scale different from the downscaled one, ground soil moisture within each grid (i.e, 1 km × 1 km) was assumed to be consistent [32].The comparison between the ground and downscaled soil moisture data varied by station resulting in mean slope ~0.987, mean RMSD ~0.041 m 3 •m −3 , and mean r ~0.53 from all 14 stations.The 1 km downscaled soil moisture produced in this study also shows relatively low RMSD and high r compared to other soil moisture downscaling studies from Im et al. [32] (resulting in the mean slope = 0.366, mean RMSD = 0.092 m 3 •m −3 , and mean r = 0.51), Choi et al. [37] (resulting in the mean slope = 0.769, mean RMSD = 0.124 m 3 •m −3 , and mean r = 0.46), Merlin et al. [93] (resulting in the mean slope = 0.523, mean RMSD = 0.078 m 3 •m −3 , and mean r = 0.58), and Djamai et al. [94] (resulting in the mean slope = 1.188, mean RMSD = 0.07 m 3 •m −3 , and mean r = 0.5), although it is not possible to directly compare the accuracy metrics among the studies.Nonetheless, the results imply that the high-resolution soil moisture produced using the proposed downscaling approach is closely related to ground soil moisture.
Figure 7 shows the spatial distribution of monthly downscaled and GLDAS soil moisture from May to September in 2015.The spatial distributions of both soil moisture data by land cover are consistent with the literature [95,96].Both soil moisture products show relatively high soil moisture levels in forest regions (i.e., southern China, Korea Peninsula, and Japan), while presenting dry soil in desert and built up regions (i.e., Shandong, Gobi Desert).North China, including the Gobi Desert, has low soil moisture conditions regardless of the season.The 1 km downscaled soil moisture over some parts in southern China was not available due to clouds, and it shows quite different conditions compared to GLDAS soil moisture.The much smaller number of training samples (less than 8% among the total number of training samples) over southern China may explain such poor performance, which implies that the performance of the empirical model highly depends on the number of training samples.The dynamic range of the predicted soil moisture was slightly smaller (~0.05 m 3 •m −3 ) than the GLDAS soil moisture because the Cubist model tends to produce results to reduce estimation errors similar to other empirical statistical and machine learning approaches, which leads to a smaller dynamic range toward mean values [32,49].While the 1 km downscaled soil moisture was underestimated in humid regions (e.g., Japan, North Korea, and Taiwan), it was overestimated in dry regions (e.g., the Gobi Desert) when compared to GLDAS soil moisture.However, the spatial pattern of the 1 km downscaled soil moisture was well matched with GLDAS soil moisture.Both soil moisture products also show a similar temporal pattern that is relatively dry in spring (May and June), with a large portion of the Gobi Desert, and relatively wet in summer (July and August), with a small portion of the desert.Daily 1 km downscaled and GLDAS soil moisture data during the growing season (May to September 2015; 153 days) were compared using r and RMSE (Figure 8).It should be noted that positive correlation (0.353 averaged) and low RMSE (<0.06 m 3 •m −3 ) appear in most areas.Cloudy regions such as southern China and southern Japan showed lower r and higher RMSE than other regions due to the limited number of training samples.The northeastern part of the study region has negative correlation, which implies that the downscaling model was not able to capture the soil moisture pattern in the area.Unlike the other areas, soil moisture in this part has a very small dynamic range (~0.1 m 3 •m −3 ) during the growing season, which possibly resulted in low correlation coefficients when the downscaled soil moisture data was compared to GLDAS soil moisture information.Although the topographic and land cover characteristics of this area are similar to those in the northern part of North Korea, the soil moisture pattern is a bit different between the two areas.Unlike the sufficient number of training samples selected in North Korea that were used to develop the downscaling model, limited training samples from the northeastern part of the study region may also explain the negative correlation between the downscaled and the GLDAS soil moisture data.

Novelty, Opportunities, and Limitations
This study developed a soil moisture downscaling model, considering its operational use.Although different machine learning approaches such as random forest may result in higher modeling accuracy to downscale soil moisture [32], they require more computational demand (about 13 times) to produce high resolution soil moisture over a large area (e.g.East Asia).The computational cost is important for the operational use of a model.While it took 15 min to produce the 1 km downscaled soil moisture map over East Asia (9600 × 6000 grids) when using the optimized regression tree, it took 3 h 25 min when using random forest with the hardware environment of Intel Core i7-4770 CPU @ 3.4GHz (Hewlett Packard, Palo Alto, CA, USA) and MATLAB 2016b (Mathworks, Natick, MA, USA).
It is also difficult to interpret the model, including the physical meaning and the process when using random forest, which uses hundreds of trees.The optimized regression tree provides explicit rules and regression models, shows high performance, and produces soil moisture data faster than random forest.Figure 9 shows the spatial distributions of the daily downscaled soil moisture and TRMM precipitation from 10 to 16 July 2015.There were heavy rains between 11 and 12 July over the Korean peninsula.The heavy rainfall caused the increase in soil moisture from 11 July to 14 July (peak).Since there was no precipitation after 14 July, the soil moisture decreased (15-16 July).The daily downscaled soil moisture produced in this study well reflects the changes in precipitation.Therefore, it can be seen that the optimized regression tree is very useful in producing valid soil moisture data with a high resolution.

Novelty, Opportunities, and Limitations
This study developed a soil moisture downscaling model, considering its operational use.Although different machine learning approaches such as random forest may result in higher modeling accuracy to downscale soil moisture [32], they require more computational demand (about 13 times) to produce high resolution soil moisture over a large area (e.g., East Asia).The computational cost is important for the operational use of a model.While it took 15 min to produce the 1 km downscaled soil moisture map over East Asia (9600 × 6000 grids) when using the optimized regression tree, it took 3 h 25 min when using random forest with the hardware environment of Intel Core i7-4770 CPU @ 3.4GHz (Hewlett Packard, Palo Alto, CA, USA) and MATLAB 2016b (Mathworks, Natick, MA, USA).
It is also difficult to interpret the model, including the physical meaning and the process when using random forest, which uses hundreds of trees.The optimized regression tree provides explicit rules and regression models, shows high performance, and produces soil moisture data faster than random forest.Figure 9 shows the spatial distributions of the daily downscaled soil moisture and TRMM precipitation from 10 to 16 July 2015.There were heavy rains between 11 and 12 July over the Korean peninsula.The heavy rainfall caused the increase in soil moisture from 11 July to 14 July (peak).Since there was no precipitation after 14 July, the soil moisture decreased (15-16 July).The daily downscaled soil moisture produced in this study well reflects the changes in precipitation.Therefore, it can be seen that the optimized regression tree is very useful in producing valid soil moisture data with a high resolution.
There are some limitations in this study.Although this approach produced daily soil moisture data that was well matched with both in situ and GLDAS soil moisture data, there are many no-data regions (e.g., southern China) due to cloud cover, especially during the wet season.Since our study period was from 2013 to 2015, we were unable to use other remote sensing data from recently launched satellites such as Global Precipitation Measurement (GPM) and SMAP.The use of the small number of input variables (i.e., six) considering the operational efficiency of the model is another limitation.There are some limitations in this study.Although this approach produced daily soil moisture data that was well matched with both in situ and GLDAS soil moisture data, there are many no-data regions (e.g.southern China) due to cloud cover, especially during the wet season.Since our study period was from 2013 to 2015, we were unable to use other remote sensing data from recently launched satellites such as Global Precipitation Measurement (GPM) and SMAP.The use of the small number of input variables (i.e.six) considering the operational efficiency of the model is another limitation.

Conclusions
This study aims to develop a soil moisture downscaling model by optimizing a modified regression tree for operational use.The optimized regression tree that consists of 59 rules and regression models produces daily high resolution (1 km) real time soil moisture data in East Asia using MODIS, ASCAT, AMSR2, and SRTM products.The 1 km downscaled soil moisture showed high correlation and low RMSE when compared to GLDAS soil moisture.Ground soil moisture data at 14 stations was also used to assess the 1 km downscaled soil moisture.The 1 km downscaled soil moisture moderately correlated with ground soil moisture and was closely related to the variations in precipitation.The spatiotemporal distributions of the 1 km downscaled soil moisture was also well matched with those of the GLDAS soil moisture data.This implies that the downscaled soil moisture may provide valuable information for identifying agricultural and hydrological processes such as drought monitoring at various spatial scales.
Our study has some limitations that should be improved upon in further research.Since some of the input parameters were from optical sensor data, there was a no-data problem due to clouds.This can be improved by adopting a hierarchical approach, i.e., applying another model without using optical sensor data for cloud pixels.Similar to other empirical approaches, regression trees tend to result in a reduced dynamic range of a target variable (i.e., soil moisture in this study) toward the mean [32].Another limitation is that the model did not use other recent remote sensing data such as GPM and SMAP due to the study period (from 2013 to 2015) and computational demand, considering the operational use of the proposed model.In future research, additional data will be incorporated to improve the performance of the high resolution soil moisture model.Cumulative distribution function (CDF) matching will be also applied to downscaled soil moisture data so that it has a similar dynamic range to GLDAS soil moisture.The proposed method will be tested for different regions such as Africa and Europe to examine the feasibility of the proposed approach to the production of global high resolution soil moisture data.

Figure 1 .
Figure 1.Study area of this research with land cover information and the location of ground stations that measure in situ soil moisture in South Korea.Refer to Table2for information on ground stations (a through n).

Figure 2 .
Figure 2. Data process flow diagram of the soil moisture downscaling model proposed in this study.

Figure 2 .
Figure 2. Data process flow diagram of the soil moisture downscaling model proposed in this study.

Figure 3 .
Figure 3. Performance of modified regression trees with different numbers of rules to identify the optimum number of rules.

Figure 3 .
Figure 3. Performance of modified regression trees with different numbers of rules to identify the optimum number of rules.

Figure 4 .
Figure 4. Calibration (left) and validation (right) results of the soil moisture downscaling model proposed in this study.

Figure 4 .
Figure 4. Calibration (left) and validation (right) results of the soil moisture downscaling model proposed in this study.

Water 2017, 9 , 332 12 of 24 Figure 5 .
Figure 5. Temporal pattern of GLDAS, downscaled, and in situ soil moisture data with precipitation information for each station (a-n) from May to September 2015.

Figure 5 .
Figure 5. Temporal pattern of GLDAS, downscaled, and in situ soil moisture data with precipitation information for each station (a-n) from May to September 2015.

Figure 7 .
Figure 7.Comparison of monthly spatial distribution between 1 km downscaled and GLDAS soil moisture data in 2015.

Figure 7 .
Figure 7.Comparison of monthly spatial distribution between 1 km downscaled and GLDAS soil moisture data in 2015.

Figure 8 .
Figure 8. Spatial distribution of (a) correlation coefficients, (b) RMSE, (c) p-values, and (d) rRMSE values between GLDAS and the 1 km downscaled soil moisture data during the growing season in 2015.

Figure 8 .
Figure 8. Spatial distribution of (a) correlation coefficients, (b) RMSE, (c) p-values, and (d) rRMSE values between GLDAS and the 1 km downscaled soil moisture data during the growing season in 2015.

Figure 9 .
Figure 9. Spatial distribution of the daily 1 km downscaled soil moisture and Tropical Rainfall Measuring Mission (TRMM) precipitation data around the Korean Peninsula from 10 to 16 July 2015.

Figure 9 .
Figure 9. Spatial distribution of the daily 1 km downscaled soil moisture and Tropical Rainfall Measuring Mission (TRMM) precipitation data around the Korean Peninsula from 10 to 16 July 2015.

Figure A1 .
Figure A1.Spatial distribution of the selected sampling points (i.e., 602 points) considering DEM, soil type, and land cover.

Figure A1 .
Figure A1.Spatial distribution of the selected sampling points (i.e., 602 points) considering DEM, soil type, and land cover.

Table 1 .
Summary of remote sensing-derived independent variables and Global Land Data Assimilation System (GLDAS) soil moisture (dependent variable) to develop a machine learning-based soil moisture downscaling model.Shuttle Radar Topography Mission (SRTM) data was released in 2013.

Table 2 .
Geographical and land cover characteristics of 14 ground stations in South Korea.

Table 3 .
Six selected sub-models from the optimized regression tree-based downscaling model.Each sub-model consists of a rule and its associated multivariate regression model.The number of cases (i.e., samples) and the mean soil moisture corresponding to each rule are also presented.

Table 4 .
The usage in percentage of input variables in the rules and multi-variate regression models produced from the optimized regression trees.

Table 4 .
The usage in percentage of input variables in the rules and multi-variate regression models produced from the optimized regression trees.

Table A2 .
The percentage of area and the selected sampling points (602 points) for each class in Digital Elevation Model (DEM), land cover, and soil type within the study area.