Performance Evaluation of the Multiple Quantile Regression Model for Estimating Spatial Soil Moisture after Filtering Soil Moisture Outliers

: The spatial distribution of soil moisture (SM) was estimated by a multiple quantile regression (MQR) model with Terra Moderate Resolution Imaging Spectroradiometer (MODIS) and filtered SM data from 2013 to 2015 in South Korea . For input data, observed precipitation and SM data were collected from the Korea Meteorological Administration and various institutions monitoring SM. To improve the work of a previous study, prior to the estimation of SM, outlier detection using the isolation forest (IF) algorithm was applied to the observed SM data. The original observed SM data resulted in IF_SM data following outlier detection. This study obtained an average data removal rate of 20.1% at 58 stations. For various reasons, such as instrumentation, environment, and random errors, the original observed SM data contained approximately 20% uncertain data. After outlier detection, this study performed a regression analysis by estimating land surface temperature quantiles. The soil characteristics were considered through reclassification into four soil types (clay, loam, silt, and sand), and the five-day antecedent precipitation was considered in order to estimate the regression coefficient of the MQR model. For all soil types, the coefficient of determination (R 2 ) and root mean square error (RMSE) values ranged from 0.25 to 0.77 and 1.86% to 12.21%, respectively. The MQR results showed a much better performance than that of the multiple linear regression (MLR) results, which yielded R 2 and RMSE values of 0.20 to 0.66 and 1.08% to 7.23%, respectively. As a further illustration of improvement, the box plots of the MQR SM were closer to those of the observed SM than those of the MLR SM. This result indicates that the cumulative distribution functions (CDF) of MQR SM matched the CDF of the observed SM. Thus, the MQR algorithm with outlier detection can overcome the limitations of the MLR algorithm by reducing both the bias and variance.


Introduction
To understand hydrological processes, including evapotranspiration, infiltration, percolation, and runoff, soil moisture (SM) is a key variable [1]. Therefore, understanding the spatial distribution of SM is crucial in analyzing hydrological processes [2]. In addition to using water resources research to study rainfall runoff, SM has been widely used in other specific fields, such as in agriculture to study plant growth and hydrometeorology to study interactions between the atmosphere and land

Materials and Methods
MQR was the main algorithm used in this study, and it is important because it can handle many types of data. In a previous study [5], input data, such as MODIS LST, NDVI, and precipitation, were selected for estimating SM using principal component analysis. A soil map for obtaining soil properties, wilting point, and field capacity was prepared from data provided by the Korea Rural Development Administration (KRDA). All spatial data, such as satellite data and soil maps, were prepared with a spatial resolution of 1 km, and observed data, such as precipitation and SM, were prepared with the same spatial resolution using the IDW technique [40].

MODIS Data
The MODIS data were prepared from the Land Processes Distributed Active Archive Center (LP DAAC, https://lpdaac.usgs.gov/) and EARTHDATA (https://earthdata.nasa.gov/), including MODIS LST (MOD11A1) and vegetation indices (MOD13Q1). The MOD11A1 provided daily per-pixel LST in Kelvin at a 1 km spatial resolution, and low-quality pixels, such as those with clouds and other atmospheric disturbances, were marked in an accompanying quality assessment (QA) layer. These pixels were corrected and reconstructed by the CM method. The advantage of this method is that it can preserve the spatial distribution, maintaining the precision of the observed data. In a previous study [5], high-precision LST data were estimated through the application of this method, and the corrected LST data were also used in this study. Please refer to the previous paper [5] for detailed methods and procedures for generating the corrected LST data.
The MODIS vegetation product (MOD13Q1) provides temporally and spatially continuous NDVI data with a 16-day interval at 250 m resolution from January 2013 to May 2015. Normally, when calculating daily SM through regression analysis, daily input data are required, but daily NDVI data were not available. Although vegetation changes vary by type of vegetation, vegetation generally becomes more vigorous from spring to summer and gradually fades from autumn to winter. In addition, rapid changes are not common in forests. Therefore, in this study, NDVI data with a 16day interval were applied as the vegetation data for daily SM estimation, and the spatial resolution was resampled to 1000 m, which is the same as the LST data.

Observed Data
In South Korea, the Korea Meteorological Administration (KMA) is constructing a high-density ground observation network through the establishment of 687 automatic weather stations (AWSs, Figure 2a). AWSs monitor weather data, such as precipitation, wind speed, and humidity, on the order of minutes, which produce data with a much higher accuracy than satellite data. The precipitation data acquired from AWSs were interpolated to a 1 km spatial resolution, the same as the MODIS data. The SM observation data were obtained from 58 stations run by various institutions (Table 1). Stations 1 to 9 were from the Automated Agriculture Observing System (AAOS) of the KMA. Stations 10 and 11 were from the Korea Institute of Hydrological Survey (KIHS), and stations 13 to 18 were from K-water. The other stations were from the Rural Development Administration (RDA). The soil map, which includes information on the field capacity, wilting point, and soil types, was essential for estimating SM. The SM map supplied by the RDA was sorted into 12 classes according to the U.S. Department of Agriculture (USDA) textural classification. However, due to the limitation of insufficient data with no minimum data for estimating the regression coefficient according to the 12 soil types, the soil types were reclassified into silt, clay, loam, and sand, based on the soil textural triangle. Sand represents sand, sandy loam, and loamy sand in the triangle. Likewise, clay represents silty clay, sandy clay, silty clay loam, and clay, and loam represents loam, clay loam, and sandy clay loam. Finally, silt represents silty loam and silt [16]. Figure 2b show the soil information at the SM stations. Of the 58 SM stations, clay accounts for approximately 48% (28 stations), and loam accounts for approximately 24% (14 stations). Therefore, considerable data are available for two soil types (clay and loam). Furthermore, in this study, the IF algorithm, which eliminates outliers, was applied to solve the uncertainty that can occur while reclassifying the soil texture into four classes. This process represents a quality control (QC) process for original SM data to which no QC processes have been applied.

Anomaly Detection Algorithm
Outlier detection, or anomaly detection, is a method to find the patterns in datasets that do not match expected patterns that differ significantly from it. There are a variety of methods such as isolation-based method, modal-based method, density-based method, and distance-based method. Among them, IF is an effective technique using a machine learning algorithm based on binary tree structures with a random sampling method that provides an ensemble of a series of trees from multidimensional training and testing data sets. Compared to other anomaly detection algorithms, the reasons for adopting IF in this study are as follows: (1) building iTrees is relatively straightforward, as users only need to randomly select a subset of the training sets; (2) it takes less time to calculate since it does not measure distance or density; (3) low memory requirements; and (4) the ensemble algorithm can overcome the low efficiency of iTrees [41]. The basic concept of IF is that the few anomalous data far from the normal cluster center can be identified through anomaly detection [42]. The IF technique consists of a two-stage procedure. A training step structures basic isolation trees that build various subsamples using random sampling from the training set. The testing step calculates the length of the path by passing samples to obtain an anomaly score through isolation trees ( Figure 3). To isolate every subsample and stage, a tree structure can be used effectively because there are few anomalies far from normal points. While normal samples can be far from the root, anomalies are closer to the root of the tree. Between the minimum and maximum values of the attribute, partitions of every IF structure are selected at random, and automatically recursive partitions are passed. A randomly selected partition is calculated, and each tree is classified by dividing different structures. Finally, each path length is calculated to determine an outlier score. The definition of the anomaly score for instance x is: Where ( ) can be estimated by ln( ) + 0.5772156649 (Euler's constant) as the harmonic number since c(n) is the constant value to normalize the average path length for n trees. is the number of nodes ( ). ℎ( ) is the path length of sample by the number of edges x traverses and iTrees from the root node until the traversal is terminated at an external node. (ℎ( )) is the average path length of each ℎ( ) from a collection of iTrees. is the anomaly score used in the following evaluation. The evaluation includes the following processes: (a) if is very close to 1, then it is clearly an anomaly; and (b) if s is much smaller than 0.5, then it may truly be a normal point. For instance, when s is 1, (ℎ( )) will be zero (0). This means that all the path length for all n trees get close to the root node.
In this study, the IF structure consisted of the sklearn-ensemble library in Python.
To confirm that the IF outlier detection technique really removed the uncertainty of the SM data, this study suggested a data removal rate (DRR) and the percentage of matching SM increases with increasing precipitation (PCP), which is COR_PCP, to assess tendency showing SM increases with increasing PCP at the same time. Two indicators are as following Equations (3)

Multiple Quantile Regression Model
It is possible to estimate the conditional quartile by considering various quantiles for the dependent variable as the parametric method. In addition, this method can be applied to a case where the distribution of the given data is large or heterogeneous [43]. In addition, since the influence of dependent variables according to independent variables can be estimated in various quartiles, it is possible to perform regression analysis considering the distribution characteristics of time series so that it is not only superior in methodology but also easy to expand to nonlinear models [44].
Quantile regression analysis can directly evaluate LST trends as a method for determining the linear and nonlinear trends of a particular quantile (r) in the overall data. This trend is expressed by the following equation [45]: where i denotes the i-th value among the n data, i = 1, 2, …, n. As seen from the above equation, the regression analysis of the min-squared regression assesses the tendency to minimize the sum of the errors of the weights multiplied by the weight line r based on the trend line of the r values. The method of finding the trend line is similar to the least-squares method, but it differs in that it uses the sum of absolute values instead of the sum of squared errors. In this way, when the absolute value is used instead of the square of the error, the effect of the outliers is less reflected in the process of obtaining the trend equation, so that it is possible to reduce the effect of excessively increasing or decreasing the trend due to the outliers. MODIS LST, NDVI, and precipitation, including antecedent precipitation from one to five days, were used to develop the MQR model as input data, and regression coefficients and equations were estimated seasonally, which were divided into spring, summer, autumn, and winter. Jung et al. [5] described the process of regression coefficients for suitability of the coefficients, such as p-value and multicollinearity. To predict the spatial distribution of SM, the MQR model was performed by using the LST variable quintiles, including 0.05, 0.

Outlier Detection Of Observed SM Data
To assess tendency showing SM increases with increasing PCP at the same time, the IF developed two algorithms. The first one is the algorithm (IF1) using only observed SM data, and the other one is the algorithm (IF2) using both observed SM data and PCP as an independent variable. Finally, the observed SM data, after applying the IF algorithm for outlier detection (IF_SM), were used as the target variable for quantile regression analysis. Table 2 shows DRR and COR_PCP at 58 stations. From that result, the average DRR for IF1 and IF2 was 23.6% and 16.0%, respectively. However, the result of IF1 showed that most of the original SM data had approximately 28.8% uncertain data, whereas the result of IF2 showed a variety of removal efficiency. This would come from considering PCP trends. To confirm the basic idea based on increasing SM with increasing PCP, this study applied for IF2 algorithm and compared that using COR_PCP. The COR_PCP showed 26.8% for IF1 and 35.2% for IF2. The results in the IF2 algorithm improved tendency to increasing PCP and SM by about 8.4%. Finally, this study selected the results of the IF2 algorithm. In Figure 4, the raw data (original observed SM) and the isolation forest considering PCP (IF_SM) data were compared at major stations and are illustrated in Figure 4, represented by different marks such as blue circles and red Xs.   IF1  IF2  IF1  IF2  IF1  IF2  IF1  IF2  IF1  IF2  IF1 Table 3 shows specific optimal regression coefficients for the 10%, 50%, and 90% quantiles. As mentioned above, quantile regression was analyzed with a total of 19 quantiles of 0.05 intervals from 0.05 to 0.95. The coefficient of determination (R 2 ) was calculated and presented to confirm the results of the MQR model. The R 2 shows a value from 0 to 1, and the higher value, the less error variance [46]. Overall, the R 2 ranged from 0.38 to 0.82 and the average R 2 of 0.61 in clay was much better than those of the other soil types. Loam had an average R 2 of 0.42. Notably, the R 2 values for clay in spring and summer were 0.76 and 0.55, respectively. The reason why R 2 of clay was low in summer was considered to be due to the climatic characteristics of South Korea associated with the monsoon season. Every year from June to July, there is a rainy season known as Jangma, in which heavy rainfall is concentrated, and it may cause some places to flood. The uncertainty of the soil moisture variation pattern is largely due to the rainy season in the summer, and the predicted accuracy decreases accordingly. On the other hand, in the spring, there is relatively little rainfall, so the pattern of soil moisture change is monotonous and seems to have a high correlation. In silt and sand, the average R 2 values were 0.40 and 0.39, respectively. In particular, R 2 was shown to be low in winter, and it is possible that there was an instrument error in the observed value because the soil was frozen in winter. Compared to the previous study [5] using the MLR model, there was no significant improvement in R 2 values of less than 0.5. The reason was determined to be that the classification of soil properties was not perfect. The observed SM data provided by the RDA showed that the observation period was only approximately one year, so that it was not stabilized, and irregular changes appeared. In addition, it was judged that the accuracy was further reduced by reclassifying the soil properties into four categories.

Performance Comparison between The MLR And MQR Models
Based on the estimated regression coefficients, SM was calculated for each LST quantile ranging from 10% to 90% and compared with the observed SM. The verification results were shown using R 2 , root mean square error (RMSE) and index of agreement (IOA) to show the extent to which the results were better than those of the previous study using MLR (Table 4) [5]. The RMSE means that the error of the model is less as it approaches 0, and the IOA ranges from 0 and 1, indicating better efficiency when the value closer to 1 [46]. Figure 5 shows the time series changes in the observed SM and calculated SM through the MLR and MQR models. These representative stations in Figure 5 were recommended by a previous paper from Jung et al. [5]. From that paper, those selected were defined by considering physical characteristics, which were field capacity (FC) and wilting point (WP), to each soil type.
The SM calculated through the MLR model showed R 2 values from 0.2 to 0.66 for the four soil types, and the average R 2 was 0.37. The RMSE ranged from 1.86% to 12.21%, and the average RMSE was 4.15%. In contrast, the R 2 and RMSE values for the MQR results ranged from 0.25 to 0.77, with an average of 0.50, and from 1.08% to 7.23%, with an average of 3.04%, respectively. While the average IOA of SM estimated by MLR was 0.54, and ranged from 0.17 to 0.88, the average IOA by MQR was 0.68, and showed a value from 0.3 to 0.87. From these results, the MQR results showed much better performance than the MLR results. The average R 2 , RMSE, and IOA improved by 0.13, 1.1%, and 0.14, respectively. These improvements came from removing uncertainty from measurement errors due to freezing and mechanical errors by IF and the advanced regression algorithm. However, because the soil map provided in this study consisted of four types, the SM prediction of the general MLR study caused these errors. Therefore, the MQR algorithm overcame the limitation by estimating the various regression equations under detailed conditions, such as season, soil types, and LST quantiles. Table 4. Comparison of the statistical analysis results between the multiple linear regression (MLR) and MQR models at 58 SM stations.

Station
No. In Figure 5, the MLR results were reevaluated based on the method and information in Jung et al. [5]. Notably, each box plot shows that the MQR SM is closer to observed SM than the MLR SM. At the 32 gauging sites in clay, the first quartile (Q1) values of the observed SM, MLR SM, and MQR SM were 28.6, 25.3, and 28.4, respectively. At these stations, Q1 by MQR showed a significant improvement. The absolute percent errors for Q1 of the MLR and MQR were 34.8% and 14.2%, respectively. The MQR result was better than that of the MLR result by 20.6%. At the 14 gauging sites in loam, the Q1 values of the observed SM, MLR SM, and MQR SM were 14.6, 21.9, and 16.9, respectively. The third quartile (Q3) values of the observed SM, MLR SM, and MQR SM were 23.1, 25.7, and 24.0, respectively. At these stations, Q1 and Q3 of the MQR showed significant improvements. The Q1 absolute percent errors of MLR and MQR were 49.8% and 15.6%, respectively. The Q3 absolute percent errors of MLR and MQR were 11.1% and 3.9%, respectively. The MQR produced improvements of 34.2% for Q1 and 7.2% for Q3 over the MLR results. At the 42 gauging sites in silt, the Q1 values of the observed SM, MLR SM, and MQR SM were 14.0, 19.9, and 15.9, respectively. At these stations, Q1 of the MQR showed significant improvements. The Q1 absolute percent errors of MLR and MQR were 41.5% and 13.4%, respectively. The MQR result was 28.1% better than that of the MLR result. At the seven gauging sites in sand, the Q1 values of the observed SM, MLR SM, and MQR SM were 11.5, 15.6, and 13.1, respectively. At these stations, Q1 of the MQR showed significant improvements. The Q1 absolute percent errors of MLR and MQR were 34.8% and 14.2%, respectively. The MQR result was 20.6% better than the MLR result.

Limitation of the MQR model
Although the model was improved, there were some results that show poor accuracy. They could have been caused by the non-standardized of the algorithm and the limitation of observed data. Moreover, ignoring other variables that might impact on the estimation of soil moisture can reduce model efficiency for the prediction of soil moisture. This study did not consider elevation and slope as geophysical features; however, these variables are factors that can explain water flows under the land surface. Of these results, SW, JJ3, CC4, TG2, and HH6, which showed low performance of the model, have low elevation and slope at the same time. The elevations were 40 m (SW), 12 m (JJ3), 9 m (CC4), 11 m (TG2), and 3 m (HH6), respectively. The slopes of these stations were all about 0%, respectively.
To simply go over in terms of these effects, SM variation by PCP as natural inflow for these five stations was analyzed. As seen in Table 5, these results show that average daily SM, when PCP was less than 5 mm/day (dry SM), was slightly bigger than SM when PCP was over 5 mm/day (dry SM), which means that this trend was unlike normal SM tendency. Moreover, dry SM at the SW station increased an average 3.1% compared to wet SM. The reason for this tendency is that these five stations are in the area around the river or relatively close to the river than the other 53 stations. In this area, the interaction between surface water and groundwater occurs actively. Thus, it would cause a strong dynamic movement of soil moisture. This result is a fragmentary analysis, which is difficult to generalize, but it is necessary to consider those variables as a further review.

Extension of input variables
In some papers for estimating SM, various factors applied such as albedo, brightness, greenness, wetness, NDVI, normalized difference water index (NDWI), normalized difference built-up index (NDBI), elevation, slope, and aspect [30]. It is thought that adding these variables will improve SM prediction performance, but it has not been applied in this study. This is because the purpose of this study was to evaluate how much performance could be improved by the MQR model compared to that of the MLR model in the previous study [5], after filtering SM outliers. In addition, it could be expected to improve the simulation performance by applying the temperature vegetation dryness index (TVDI) [25], considering vegetation (NDVI) and land surface temperature (LST).
Even though this study did not consider the additional variables, we proceeded to improve the existing simple algorithm based on the satellite image LST data. This study achieved meaningful results; however, it is still necessary to consider the areas that did not improve significantly. Although the accuracy of the model could be improved by adding new variables, as the variables become more complex, multicollinearity between variables and overfitting of untrained variables would increase. This can lead to significant side effects for the non-verified period.
Nevertheless, as mentioned in Section 4.1, considering the hydrological system of soil moisture, we found out that the elevation, slope, and distance from the stream will have an additional effect. Based on surface and groundwater flows, it could be determined that the elevation and slope values would increase the movement of moisture due to slope in the soil. Furthermore, the soil moisture in the area close to the river may be sensitive to the influence of groundwater in addition to precipitation and soil characteristics. Therefore, if sufficient data are available, it is expected that future studies will greatly improve this algorithm.

Conclusions
This study aimed to improve the original MLR algorithm for the indirect measurement of spatial SM. To improve the original algorithm, this study first performed outlier detection of observed SM using the IF method, which is a type of machine learning, and the spatial distribution of SM was estimated using an MQR model from 2013 to 2015. As input for the MQR algorithm, MODIS LST, MODIS NDVI, precipitation, and the soil type were used as independent variables, with consideration for the environmental attributes of SM. Because of the limitation of insufficient data, the soil type was reclassified into four soil classes: silt, clay, loam, and sand. For this reason, the soil information at 58 stations was not uniformly distributed, and certain soils were more common. Therefore, this study had to classify four soil types. The primary results are summarized as follows: 1. As a result of outlier detection, the average DRRs for IF1 and IF2 were 23.6% and 14.4%, respectively, at 58 stations. In addition, average COR_PCP for IF1 and IF2 were 29.9% and 37.6%, respectively. The result of IF2 shows that the IF algorithm considering PCP (precipitation) can improve suitability of the outlier detection. Finally, the IF2 result was used as an input variable. 2. When comparing the MLR and MQR results, the R 2 and RMSE values for MLR were 0.20 to 0.66 and 1.86% to 12.21%/day, respectively, while the R 2 and RMSE values for MQR were 0.25 to 0.77 and 1.08% and 7.23%/day, respectively. From these results, the R 2 improved by 0.13 from an average of 0.38 to 0.50, and the RMSE decreased by 1.1%/day errors from an average of 4.15% to 3.05%/day. 3. Finally, in addition to improvement in accuracy, box plots were constructed for the four major stations representing each of the soil types to match the cumulative distribution functions (CDF) between observed SM and estimated SM, including MLR and MQR. At these stations, Q1 and Q3 of the MQR showed significant improvements. The Q1 and Q3 absolute percent errors for the MQR improved by 25.9% and 5.2%, respectively. The method of the indirect measurement of spatial SM using MODIS LST, NDVI, and antecedent precipitation from the previous study [5] was verified. Additionally, MODIS LST corrected by the CM technique ensured the reliability of the data. Compared to previous results from the MLR model, improvements were seen not only in the R 2 of the MQR model, showing a 62% (0.37 to 0.50) improvement, but also in the distribution of the MQR, such that the CDF was close to the distribution of observed SM. This method overcame the limitations of the previous model by improving both the bias and variance. Nevertheless, since there were not enough data spanning more than two years at most stations, all data spanning less than two years were used to train the MQR model. Therefore, this study could have an overfitting problem for prediction. Future research could resolve this issue by obtaining more than two years' worth of observed SM data and by splitting the acquired data into training and verification subsets.