Short-Term Soil Moisture Forecasting via Gaussian Process Regression with Sample Selection

Soil moisture is a critical limiting factor for crop growth. Accurate soil moisture prediction helps to schedule irrigation and improve the crop production. A soil moisture prediction method based on Gaussian Process Regression (GPR) is proposed in this paper. In order to reduce the computation time of the GPR model, the Radially Uniform (RU) design algorithm was incorporated into the sample selection during the training procedure. Thus, representative training samples are identified and less training time is required. To validate the proposed prediction model, the soil moisture data collected in Beijing, China, was fully utilized. The experimental results demonstrate that the forecasting performance of the GPR model with the RU design algorithm is generally better than that of the generic GPR model in terms of less forecasting errors for both deterministic and probabilistic forecasting, while less computing time is needed for the model training.


Introduction
The structure and function of the natural hydrological system rests with the soil moisture in the hydrological cycle [1][2][3][4]. The distribution of mass and energy flux over land and atmosphere is decided by soil moisture. Thus, soil moisture is a vital indicator for evaluating water and energy balance [5].
Drought is one of the main natural disasters for agriculture all over the world. Different from other natural disasters, it has the characteristics of frequent occurrence, wide coverage and long duration. As a consequence, agriculture production is significantly influenced by drought. Soil moisture is an important variable for drought assessment and forecasting [6][7][8], as well as flood and landslide simulation and prediction [9][10][11][12]. By analyzing and predicting changes in soil moisture, it is possible to predict the potential drought and schedule irrigation [13]. Besides predicting droughts, soil moisture forecasting is also of great importance to flood management. Soil moisture in the functional landscape describes initial conditions of the watershed well and may provide valuable information for flood forecasting and early warning systems [14].
In addition to preventing drought and flood, soil moisture and agricultural constants are also closely linked, since soil moisture indicates the water content in soil for plant growth. Therefore,

Prediction Model
To improve the forecasting performance of the GPR model, the training dataset is processed and only the most representative subset is selected. The RU algorithm is employed to sample the most representative subset from the full dataset.

Gaussian Process Regression
Gaussian process (GP) is a data-driven model according to the Bayesian theory and statistics. It is applied to addressing complex regression issues such as high dimensional, small sample size and highly non-linear while it has mighty generalization performance. Different from neural networks and support vector machines, GP has numerous advantages, including easy implementation, hyper-parametric adaptive acquisition, non-parametric inference flexibility, and output probabilistic significance.
GP is a random variable set, and the linear combination of discretionary random variables in the GP follows the normal distribution. Moreover, each finite-dimensional distribution is a joint normal distribution [28]. The GP is totally derived from its mean and covariance functions and has many similar properties to the Gaussian distribution [29] Given N groups of dataset, D = {( X i , y i ), i = 1,2, . . . ,N}, where X i is the input and y i is the noisy output. For the simplest case, the noise is assumed to be independent, normal and additive, so that the relation between the latent function f(X) and the observed noisy target y is where ε is the noise, σ 2 noise is the variance of the noise. Given a random process with mean function µ(X) and covariance function k(X, X ) in real value, they satisfy Equations (2) and (3).
The random process f(X) is a Gaussian process, which can be described as: In general, µ(x) is set to 0 and formula (4) can be rewritten as: where X is the learning sample whose measure in the GP is the finite-dimensional distribution of the GP. As defined by the GP, the finite-dimensional distribution is a joint normal distribution as: The predictive distribution of the function values f * is computed at the test location X*. The probability distribution of f* is described in Equation (7).
If taking the effects of noise into account, the probability distribution of the observed noisy target y is shown in Equation (8): y ∼ N(0, k(X, X) + σ 2 noise I) The joint probability distribution is depicted in Equation (9), y f* ∼ N 0, k(X, X) + σ 2 noise I k(X, X*) k(X*, X) k(X*, X*) Using the conditional distribution properties of the Gaussian distribution, equations are obtained in Equation (10).
The above formulas give the prediction form of GPR, which is also the finite-dimensional distribution of the test sample for the test space posterior. The prediction of the response corresponding to X* assumes to be: y * =μ (11) Covariance function K is a covariance matrix, whose entries are given by the covariance function K i,j = k(X i , X j ) . The covariance matrix is required to be a semi-positive definite matrix, and the kernel functions are semi-positive definite matrices so that the covariance matrix can be obtained by the kernel function. The Radial Basis Function (RBF) kernel [30] is employed in this paper.
Since the large dataset may increase the kernel size and further significantly increase the computation time, controlling the size of the dataset can help to execute the GPR algorithm in a reasonable time. In this study, RU design was employed to select the most representative subset from large datasets so that the new training dataset is obtained and the kernel size is better controlled. In the meantime, as the newly selected subset is the characteristic concentration of the previous complete dataset, the prediction accuracy is hardly affected. Therefore, RU design can improve the prediction performance of the GPR model for large dataset problems.

Description of RU Design
Suppose that we have T = {{ x 1 , Y 1 ), . . . , (x N , Y N ) , where N is the number of a big dataset, and each x i is a normalized vector of d-dimensional inputs. If N is very big, the training time of the GPR model built with the full training dataset will be too long. Thus, a representative subset (the chosen data) D = (x 1 , Y 1 ), . . . , (x n , Y n ) is generated, and it can greatly improve the model training speed without inducing the loss of the accuracy. If D is randomly chosen from T, it is possible that the chosen data are not representative, so the prediction accuracy is influenced. It shows that the chosen data D ought to be representative of T and there must be a design point nearby each point in the full training data T.
RU algorithm is proposed in [31] and pursues to maintain the density of x 1 , . . . , x N in the radial direction while minimizing the difference between the distributions of the response values in full training data and those of the chosen data.
The chosen data are featured by minimizing a standard ϕ if the following conditions are met: the radial distribution of the response values in the chosen data F(r) = n i=1 I( x i − m 2 ≤ r)/n is adequately close to the radial distribution of the response values in full training data where m is the median of the training data. This standard ϕ evaluates the difference between the distributions of response points in the chosen data and the full training data.

Procedure of RU Design
(1) Sort x 1 , . . . , x N in ascending order according to the distances from the median x i − m 2 , i = 1, . . . ,N. x i is the point with i-th smallest distance to the median, i.e., x 1 is the point which is closest to the median.
(2) Set (k 1 , . . . , k n ) = n N , 2n N , . . . Nn N . The notation in N demonstrates the in N rounded up. Then, take the N sorted training data into n teams, where x i is allocated to the team k i .
(3) Select the design points at random from each team to constitute the initial chosen data D 0 = x 1 , . . . , x n , where x j is from team j. Next, set up t = 1.
Else, put up t = t + 1 and back to step (4). In RU design, x 1 , . . . , x N are always standardized that can assure x i ∈ [0, 1] d . If the chosen data and full training data meet the condition (13), it demonstrates that the radial distribution of the chosen data F(r) is an approach to the radial distribution of the full training data G(r).

Choice of the Optimization Standard
Given G different responses: The proposed standard value is computed in Equation (14).
ϕ(D) evaluates the representativeness of the response points chosen by the RU design algorithm. In this paper, x 1 , . . . , x N denote N soil moisture time series data for 8 consecutive days. Y(1), Y(2), and Y(3) indicate the daily average soil moisture on day 9, day 10, and day 11, respectively.

Prediction Interval Formulation
Suppose a collection of observed points (x i , t i ) N i=1 , the forecasting target can be presented as where t i is the i-th forecasting target corresponding to x i , the vector of the input, ε(x i ) indicates the noise whose mean is zero and g(x i ) is the true regression mean. The prediction of the GPR modelĝ(x i ) is an assessment of the true regression value g(x i ) . If the evaluated error and noise are independent in statistics, the variance of the total prediction errorsσ 2 t (x i ) is able to be gained from the variance of model uncertaintyσ 2 g (x i ) and the variance of where z 1−α/2 is the critical value of the standard Gaussian distribution, depending on the prospective confidence level 100(1−α)%, and α is the significant level. The future target t i is expected to lie in the constructed PI with the nominal probability as, 2.4. Performance Assessment

Point Prediction Evaluation
To better evaluate the accuracy of point predictive results, three different assessment metrics, root-mean-squared-error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are utilized to compute the prediction errors of the proposed model. The formulations are listed in Equations (15)-(17), respectively.
where y i is the actual value of the soil moisture,ŷ i is the predictive value of the soil moisture, y i is the mean value of the actual value of test dataset, and M is the number of the test dataset.

Reliability
The reliability is a significant attribute for assessing the accuracy of the approximated PIs. According to the previous introduction of PI, the feature target t i is supposed to be included in the constructed interval with the nominal probability 100(1−α)%, known as the PI nominal confidence (PINC). Given N test data points, PI coverage probability (PICP) indicates the pragmatic coverage probability of the corresponding PIs, defined in Equations (23) and (24).
The closer the obtained PICP is to the corresponding PINC, the higher the reliability of PIs. Therefore, the average coverage error (ACE) is able to evaluate the reliability of PIs.
With the reliability increasing, the ACE is getting near to zero.

Sharpness
Increasing the width of PIs can easily improve the reliability of PIs, but it makes no sense for practical applications. The reliability index, PICP, is also interrelated to the sharpness, which is applied to evaluating the quality of PIs in this study.
t (x i ) , is calculated according to Equation (26).
The interval score can assess the overall skill of soil moisture PIs and reflect the sharpness perspective in Equation (27).
The overall score value Ω (α) t can be computed as, The interval score grants the narrow PI and gives penalties when the target is not in the PI. Thus, it takes both reliability and sharpness into consideration and it can play a significant role in evaluating the overall skill of PIs. When the reliabilities of PIs are similar, the larger score is equivalent to the higher sharpness and overall skill [32].

Data Description
In this paper, the soil moisture data were collected during the period from 28 February 2012 to 8 November 2016, in Beijing, China. The collected original daily average soil moisture data were utilized to form the soil moisture time series as training and test datasets for analysis and prediction. The collected soil moisture data are plotted in Figure 1. During the experiment, the 8 continuous historical soil moisture data points and its corresponding response points (the 9th, 10th, 11th data points) form a data item. In other words, the continuous soil moisture of 8 days is used to forecast the soil moisture in the following 3 days. The full dataset consists of 1272 data items. The first 1000 data items make up the complete training set, and the rest of the 272 data items form the test set.
Y(1), Y(2), and Y(3) indicate the daily average soil moisture of the 9th, 10th, 11th days, respectively. The full training dataset makes it very time-consuming to develop the GPR model, and thus the RU algorithm is applied to sample the representative dataset from the large full dataset. The sample sizes of 100 and 500 are considered for the sample selection to examine the performance of the proposed method. Finally, the GPR models built with the full training data and the chosen data separately are tested on the same test dataset.

Point Prediction Analysis
The prediction performance of the proposed model built with the full training data and the chosen training data in predicting Y(1), Y(2), and Y(3) are evaluated. Meanwhile, the published forecasting model based on ARMA algorithm [13] is considered as a benchmark. Four metrics, RMSE, MAE, MAPE, and time, are utilized to assess the forecasting performance, and experimental results are shown in Tables 1-3. It can be seen from Tables 1-3 that the GPR model with the training sample sizes of 100 obtained the best performance over three models in terms of the lowest RMSE, MAE, and MAPE values. The GPR model trained using the 500 selected training samples yielded worse prediction results for Y(1) and Y(2) compared to the GPR model based on the full training dataset. However, the error difference is minor between these two models, and the forecasting performance of the GPR model trained using the 500 selected training samples is better for Y(3) prediction.  The point prediction results from Tables 1-3 show that using the chosen data can improve the prediction accuracy and much less computation time is required. Thanks to the RU design algorithm, the most representative subset is selected from the original training dataset. The better setting of the selected sample size can assist in improving the prediction performance of the GPR model and reducing the computation time simultaneously. Therefore, the proposed method is applicable for GPR point prediction using the large dataset. Compared with the ARMA-based model, the GPR model with the training sample sizes of 100 yields better forecasting performance in terms of lower RMSE, MAE, and MAPE values as well as shorter computing time. To further illustrate the performance of the proposed method, the results of the GPR models built with different training samples for Y(1), Y(2), and Y(3) prediction are plotted in Figures 2-4. In point prediction, the 100 training samples were selected as the chosen data.    Figures 3 and 4, the prediction results in Figure 2 are closer to the actual soil moisture. It indicates that the prediction accuracy decreases with the increasing of the forecasting horizon.

Compared with the prediction results shown in
From Figures 2-4, the red dotted line denotes the actual value curve, the blue line demonstrates the forecasting value curve of the GPR model trained in the chosen data whose size is 100, and the green intermittent line indicates the forecasting value curve of the GPR model trained in the full training dataset. It can be seen that the blue curve is closer to the red curve, showing that the prediction accuracy of the GPR model trained using the chosen training samples is higher than the GPR model based on the full training dataset.

Interval Prediction Result Analysis
As shown in Tables 4-6, the GPR model built with the training dataset size of 500 generally performs the best for Y(1), Y(2), and Y(3) prediction. However, the GPR model based on the full training dataset generates better PICP and ACE values for confidence level 90% in Y(2) and Y(3) prediction. The GPR model trained with the 100 samples obtained the worst interval prediction performance over the three models. A reasonable reason is that the GPR model cannot learn the true distribution well with fewer training data samples. To demonstrate the interval predication results, the PIs of the proposed method are provided in Figures 5-8. In interval prediction, the 500 training samples are selected as the chosen data.

Discussion
In terms of point prediction, it can be seen from Tables 1-3 that the GPR model with the training sample sizes of 100 obtained the best performance based on the two datasets. Thus, the extracted subset training dataset includes the most representative training samples and the well-performed forecasting model is developed. Meanwhile, because only 100 data samples are utilized, two benefits can be observed: (1) since noise data is less likely to be sample, the forecasting accuracy can be improved; (2) the computational time is significantly reduced due to the small size of the kernel matrix. The GPR model trained using the 500 selected training samples yielded worse prediction results for Y(1) and Y(2) compared to the GPR model based on the full training dataset. However, regarding the Y(3) forecasting, the GPR model trained using the 500 selected training samples performed better than the latter model. A possible reason is that some noise points are included in the 500 samples, and thus the model performance degrades. Compared with the widely applied ARMA algorithm [16], the proposed model has achieved better performance. The execution time of the GRP model based on the 100 training samples is much less than that of the ARMA model. Therefore, the proposed model is more suitable for real-time forecasting applications.
For interval prediction, as shown in Tables 4-6, the GPR model built with the training dataset size of 500 generally performs the best for Y(1), Y(2), and Y(3) prediction. However, the GPR model based on the full training dataset generates better PICP and ACE values for confidence level 90% in Y(2) and Y(3) prediction. The GPR model trained with the 100 samples obtained the worst interval prediction performance over the three models. This phenomenon may occur because the GPR model cannot learn the true distribution well when the training data sample is small, and 500 sample points are enough for the model to learn the sample distribution, and the entire data set is not required. The interval prediction verifies the effectiveness of the RU design algorithm in selecting representative samples.

Seasonal Factor Analysis
Taking the impact of the season factor on the forecasting performance into account, 30-day data of June (Summer) and September (Fall) was used as the test dataset respectively to compare the prediction performance based on different training sample size. The test data of June and September are noted as test June and test Sep.
From Tables 7 and 8, it is observed that the RMSE, MAE, MAPE values of the GPR models for test June are all higher than those of models for full test data. However, the performance of GPR models for test Sep is better than that of GPR models for full test data. The interval prediction results of the two selected months are drawn in Figures 8-11.     It is demonstrated in Figures 11-14 that the seasonal factor has an impact on the prediction performance of the GPR models. In June, the weather changes dramatically with much rainfall. Therefore, compared with September, the prediction in June is more difficult. It is more challenging to forecast soil moisture in summer.

Conclusions
In this paper, an RU design algorithm was applied to sample representative data points to develop the GPR model in predicting the soil moisture. The following conclusions are drawn from this study.
(1) Through the RU design algorithm, the most representative subset (the chosen data) was sampled from the complete training dataset. Based on the chosen data with proper sample size setting, the performance of the GPR model was improved and much less computing time was required.
(2) The RMSE, MAE, and MAPE values of the GPR model built with the chosen data were smaller compared with the GPR model based on the full training data in terms of the point prediction.
(3) Regarding the interval prediction, PIs of the GPR model built with the chosen data were better considering the reliability, sharpness and overall skill.
In future work, the integration of the RU design algorithm and other learning algorithms, such as deep neural networks [33] and boosted regression trees [34], will be investigated for the soil moisture prediction.

Conflicts of Interest:
The authors declare no conflict of interest.