Groundwater Prediction Using Machine-Learning Tools

: Predicting groundwater availability is important to water sustainability and drought mitigation. Machine-learning tools have the potential to improve groundwater prediction, thus enabling resource planners to: (1) anticipate water quality in unsampled areas or depth zones; (2) design targeted monitoring programs; (3) inform groundwater protection strategies; and (4) evaluate the sustainability of groundwater sources of drinking water. This paper proposes a machine-learning approach to groundwater prediction with the following characteristics: (i) the use of a regression-based approach to predict full groundwater images based on sequences of monthly groundwater maps; (ii) strategic automatic feature selection (both local and global features) using extreme gradient boosting; and (iii) the use of a multiplicity of machine-learning techniques (extreme gradient boosting, multivariate linear regression, random forests, multilayer perceptron and support vector regression). Of these techniques, support vector regression consistently performed best in terms of minimizing root mean square error and mean absolute error. Furthermore, including a global feature obtained from a Gaussian Mixture Model produced models with lower error than the best which could be obtained with local geographical features.


Introduction
In many countries, groundwater is one of the key natural resources that supplies a large portion of the water used by a nation. Besides its use in households and businesses, some other groundwater consumers include: (i) rural households and public water supplies that depend on wells and groundwater; (ii) farmers who use groundwater to irrigate crops and water their animals; and (iii) commercial businesses and industries that depend on groundwater for their processes and operations. Furthermore, the importance of groundwater can be revealed in its usage in supplying springs, water in ponds, marshlands, swamps, streams, rivers and bays. However, despite its unequivocal importance, groundwater levels in aquifer systems are often not constant and depend on recharge from infiltration of precipitation.
Several major acts and regulations such as the South African national water Act [1] and the 4th World Water Forum [2] recognize water as a basic human need, which is a major contributor to social development since it helps to alleviate poverty [1]. Hence, there is a growing interest towards the use of groundwater to help alleviate this crisis [2]. Groundwater is a vital freshwater resource which provides around 50% of the available drinking water according to UNESCO [3]. Also, sectors like agriculture, and industry greatly depend on groundwater for their operations due to its widespread availability and the fact that it is not easy polluted [3,4]. Therefore, in 2015 the United Nations have reaffirmed their commitment regarding the human right to safe drinking water and sanitation by identifying it as one of the 17 Sustainable Development Goals to be pursued by 2030 [5].
Predicting groundwater availability is important to water sustainability and drought mitigation. It can provide useful insights based on real data of what happened when the flow of streams and rivers declined and/or when water supply issues developed into a drought. Machine-learning tools technologies have the potential to drive groundwater knowledge discovery and management by assisting in the prediction of groundwater availability. This can be done by enabling the collection of massive water datasets, storing these datasets into databases and processing these datasets to get useful insights which can be used by water resource managers to: (1) anticipate water quality in unsampled areas or depth zones; (2) design targeted monitoring programs; (3) inform groundwater protection strategies; and (4) evaluate the sustainability of groundwater sources of drinking water.
This paper uses a regression-based approach to predict full groundwater images based on sequences of monthly groundwater maps of the southern part of the African continent using the Gravity Recovery and Climate Experiment (GRACE) dataset [6]. Five machine-learning techniques are implemented on the GRACE dataset and compared to predict pixels in future frames of the dataset. These are extreme gradient boosting (XGB), multivariate linear regression (MLR), random forest (RF), multilayer perceptron (MLP) and support vector regression (SVR). The prediction is guided by: (i) performing feature selection based on the XGB feature importance bar on the previous lags (pixels); and (ii) investigating the effect of adding other features such as the temporal features, position indices, and global features obtained by Gaussian mixture models (GMMs) fitted to peak areas on each image. This paper is organized as follows: Section 2 provides a background on water prediction, citing relevant literature in the field; Section 3 describes the algorithms used in this work; Section 4 discusses the methodology used for ground water prediction; Section 5 provides and discusses the results obtained; and Section 6 furnishes the conclusions.

Background on Groundwater Prediction
With the increase in population size coupled with urban expansion, water demand has dramatically increased, which has led to the over-exploitation of groundwater in many countries around the world [7,8]. This highlights the importance of groundwater forecasting. Accurate prediction of groundwater can help government officials determine the best approach to manage groundwater effectively [9]. The main tools for groundwater prediction are based on physical models and data-driven models [10]. Physical models require a large amount of detailed hydrological data, which suffers from a lack of accuracy during its collection and pre-processing [9]. Therefore, data-driven models tend to be more appealing, since they require less data and are more reliable [3,11].
Statistical models like multivariate linear regression (MLR), and various time series models such as autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA), have been used to investigate patterns between the input and the output of groundwater data to make future predictions. The following studies have investigated: MLR for groundwater prediction [12][13][14]; and time series models for groundwater prediction [12,[15][16][17]. Both techniques are considered to be linear fitting models [11]. Time series models have the advantage of accounting for the correlations that arise between data points [18]. In general, however, linear fitting is not ideal in describing the nonlinear behavior of groundwater. Hence, recent research has made use of MLR models more for comparative purposes [11].
In addition to these techniques, a range of machine-learning techniques have been applied to the problem, including MLP in [12,[19][20][21][22][23], SVR in [19,24] and recently RFs in [25,26]. The use of XGB is rare in the scheme of groundwater prediction, and is found in only a few studies such as [27,28].
The above studies can be broadly divided into those that predict the groundwater level (GWL) and those that estimate the terrestrial water storage (∆TWS). GWL provides an idea of the groundwater level, whereas ∆TWS provides an idea about the volume of the groundwater. The GRACE database gives geographical ∆TWS levels monthly [6]. The significance of GRACE in hydrology is that it can provide an understanding of groundwater storage conditions and trends [29]. GRACE has been used as a predictor to help in the estimation of the GWL in [29,30].
In the literature, there are two main approaches to the problem of sequential image prediction. The first type involves taking a sequence of images as an input to predict future frames using deep learning techniques such as Convolutional Long-Short-Term-Memory (ConvLSTM). Usually, the images used for this type of prediction are separated by relatively short time intervals e.g., 6-10 min [31][32][33][34][35][36][37]. This approach normally does not involve any feature selection approach, since deep learning related techniques are known for their feature selection and reduction properties. However there are several concerns when using deep learning models: they depend heavily on a large quantity of high quality data to produce an effective model; they are very costly to train and use, in terms of time and resources; deep learning models are often viewed as black boxes [38] which means that it is very difficult to unpack and understand the automated feature selection process that eventually takes place and the predictions that arise from any given deep-learning-based model. This leads us to the second approach in which machine-learning techniques can be used for single-output regressing problems. For GRACE ∆TWS image reconstruction, the authors in [27] used both XGB and RFs to acquire the importance of 20 features. It was shown that the precipitation of the two months prior to prediction is the most important variable for estimating the TWS dynamics. In [28], authors manually selected 11 hydrological predictors including the total precipitation and snow cover to predict ∆TWS. As for the idea of using previous pixels to predict current pixels has not been investigated enough in the literature. The authors in [39] made a comparison between Support Vector Machines (SVMs) and RF in predicting the present grid-based rainfall up to 3 h ahead, where the input involved the antecedent radar-derived rainfall. The authors in [40], used ANNs to predict full water vapor images every 30 min, where they included information from two previous images.

Techniques Used
In the following section we describe the tools and the technologies that has been used during the study. A total of five machine-learning techniques were used in this study for image prediction: Aside from the task of prediction, XGB was also used as a feature extraction and selection tool. As for feature engineering we used Gaussian Mixture Models (GMMs) to capture global features-mean and variance-of past images. For evaluation of the trained models, we used the RMSE and the MAE as evaluation metrics. All of the above-mentioned tools were implemented using the scikit-learn library [41] in the Anaconda python distribution (version 2020.07) with their default hyper-parameter settings.
Sections 3.1-3.6 describe each of the five machine-learning techniques listed above. Section 3.7 describes the metrics used to evaluate the trained models.

Multivariate Linear Regression
The level of correlation between the predictors and the output variables are usually estimated by regression models to determine their relationship form [42]. In linear regression, the mean square error is used to fit the models and to assess the performance of the trained models on the testing set [42,43]. In general, MLR is used to discover the hyperplane that best fits all individual data points [42]. For simplicity, in the following sections, MLR will be abbreviated as LR.

Multilayer Perceptron
MLPs are a type of artificial neural networks, which is a class of models inspired by the biological nervous system of the human brain. They can emulate complex functions like decision making, and pattern generation [44]. Like the human brain, MLPs also consist of a set of processing units called 'neurons', which are connected to each other. Each neuron is a multi-input and single-output nonlinear element [45]. Neurons mostly operate in parallel, and are arranged in multiple layers which include an input layer into which the data are fed; hidden layer where the learning takes place; and an output layer [44]. MLPs can detect complex nonlinear relationships through a learning process that involves the adjustment of the weighted connections that exists between the neurons. This gives MLPs the ability to perform two important functions: pattern classification and nonlinear adaptive filtering [46].

Random Forest
RF uses an ensemble of classification and regression trees. Each tree is built using a different bootstrap sample (with replacement) from the original data [47]. Compared to traditional trees, RF adds a randomness layer to bagging, since in traditional trees each node is split using the best split among all variables [48]. As for RF, only a random subset of the variables is used when splitting a node during the construction of a tree [47,48]. As a result of the random constructions, RF provides robustness to overfitting as compared to some other techniques [48,49].

eXtreme Gradient Boosting
Like RF, XGB is an ensemble learning technique. XGB relies on gradient boosting to form a combined prediction. In XGB, the predictors in a tree are built in a sequential manner, and are trained on the residuals of the previous learners, so that errors are reduced step by step [27].
In the scikit-learn implementation in XGB, the plot.importance command can be used to determine feature importance for the features in trained predictive model [50]. The plot.importance command computes for each separate feature the sum of estimated improvements in squared error risk for all decision nodes employing that feature, averaged over all trees used in the model. The averaging greatly reduces the masking effect which occurs when variables are correlated [51].

Support Vector Machine and Support Vector Regression
SVM is a powerful machine-learning technique that has the capability to perform structural risk minimization (SRM), which enables it to avoid overfitting by minimizing the bound on the generalization error [52]. SVMs may be extended to apply to estimation and regression problems: this extension is known as support vector regression (SVR) [53]. SVR maps the input data into a higher-dimensional feature space via nonlinear kernel functions [54]. The objective is to choose a vector of regression coefficients with a small norm, while minimizing the sum of the distances between the data points and the regression hyperplane in the higher-dimensional space [55].

Gaussian Mixture Models
Gaussian mixture models (GMMs) may be used for clustering [56] or as parametric models of the probability distribution of continuous features [57]. The user specifies the number of Gaussians in the model, and the means and covariances of the Gaussians are automatically computed using an expectation maximization (EM) algorithm [58].

Performance Metrics
The accuracies of the above machine-learning models are evaluated using the root mean square error (RMSE) and the mean absolute error (MAE). Both metrics are commonly used to measure the forecasting accuracy [59]. RMSE is more sensitive to outliers and is more appropriate for Gaussian-distributed errors, while MAE weights all errors equally [60]. The RMSE and MAE are computed as follows: where y obs i and y pre i refer to the observed and predicted value of the ith output, respectively.

Groundwater Prediction Methodology
Following the flowchart in Figure 1, we start by discussing the data set, followed by the pre-processing of the images and the preparation of the data set. Then we speak about the feature selection that was done using XGB, and feature engineering using GMM. Finally, we end up with the experiment. Our end goal is to predict groundwater on a pixel level to end up with a full image using a sequence of images as an input.

Monthly Groundwater Data Set
The dataset used in this study consists of 174 monthly groundwater satellite images between March 2002 and May 2019. Each image has a size of 360 × 180 pixels, and provides a color-coded representation of the ∆TWS of the earth's land surface. A sample image from the dataset is provided in Figure 2. The images were originally obtained as part of the GRACE survey conducted by the U.S. National Aeronautical and Space Administration (NASA). The actual data was obtained from the Physical Oceanography Distributed Active Archive Center (PO. DAAC) website [6]. Using this dataset posed some serious challenges. Several monthly images in the dataset were missing. Neglecting these months would disrupt the periodicity/seasonality in the data, which is a key aspect of the data. Further reducing the dataset was not feasible, since the number of images is already on the small side for machine-learning applications. Finding better methods for dealing with missing images is an ongoing research topic.

Image Pre-Processing
Predicting a full color image at the pixel level would be computationally expensive, since the full image consists of R,G,B values for 360 × 180 × 3 pixels. Hence, to reduce the computational cost, we transformed the images into grayscale, and cropped the images to focus on the southern area of the African continent with a size of 47 × 51. An image of the pre-processed data is shown in Figure 3 (left). The dataset provided from PO.DAAC had missing months. We imputed the data for the missing months by replicating the previous months' images. Out of the 174 frames, we deleted the first two frames because of a gap of about 100 days between the 2nd and the 3rd image in the data. This left 172 images, and after image imputation we ended up with a dataset of 190 images. Altogether 18 images were imputed, which amount to about 10% of the original 172 images. We then applied a sliding window to form 161 sequences of 12 consecutive images. The first 149 sequences were used for training, and the rest for testing. We did not use any of the imputed images as labels for the models to train on.

Feature Selection
Since the dataset in this research is small, it was particularly important to choose a set of features of limited size (to avoid overfitting) but which still captures essential information that can be used for prediction. In our model we used both local and global spatiotemporal features, and additionally performed a rescaling, as described in the following subsections.

Same-Pixel Features
Our first set of candidate features for prediction of an image pixel consists of the same pixel location for the 12 previous months. A similar choice has been made in previous studies [39,40,61]. To select the most important of these 12 features, a sliding window technique is used, as shown in Figure 3 (right): the same pixels within a prior 12-month window were used to predict the corresponding pixel in the 13th month.
To evaluate the relative importance of these features, XGB with the gain metric was applied to the training set. Figure 4 shows the results. In the figure, f(0) stands for the same month in the previous year while f (11) stands for the previous month. The graph shows that f(0), f(11), and f(1) (12, 1, and 11 months previous, respectively) have the greatest importance. This finding agrees with [62,63], which also used the previous and 12-month prior pixels to predict corresponding points in current month. Based on our results, we created seven different feature sets labelled a-g as follows: • a = f(0, 11)

Other Local Spatiotemporal Features, and Rescaling
Because of the geographical and seasonal nature of groundwater levels, the following spatiotemporal features were also deemed to be significant and were used: Since most of the pixel values are low and high values were relatively infrequent, the pixel values were replaced by their square roots to regularize the scale. The square root transformation was similarly applied to inputs in [64], and to outputs in [65,66].

Global Feature Generation Using Gaussian Mixture Models
In this subsection we describe how we used Gaussian mixture models (GMMs) to generate global features. This idea came from observing that regions of high groundwater level seemed to form shapes that could be described as Gaussian distributions, which propagated from image to image. The means and the covariances of these Gaussian-shaped features can be used as global features that describe the motion of high regions from image to image. To apply GMM to an image, we converted the image to a set of pixel locations by randomly selecting pixels with selection probability proportional to the pixel's scaled intensity (see Figure 5). These pixel locations were fed to the GMM algorithm which returned means, covariance matrices, and weights of Gaussian clusters. In this study, we set up the algorithm to have only one cluster. The pixel located at the cluster mean and the two eigenvalues of the covariance matrix were used as global features. We applied GMM to image 10 (two months previous), and image 11 (one month previous), yielding a total of 8 global features. As described above, our application of GMM involved a randomization when choosing pixel locations. To account for this randomness, when evaluating models that used GMM features we created 100 different models using different randomization. From those 100 results we took the per-pixel averages to get a single model, and took the RMSE and MAE for this model to obtain accuracy estimates.

Performance Results
Tables 1-4 show performance accuracies for models trained using different features. Table 1 uses only same-pixel data from previous months; Table 2 adds the features (i,j), which stands for the pixels position in a 2D array; Table 3, adds the time stamp (denoted by t); Table 4, additionally applies the square root transformation (denoted by s) to the pixel values. The code together with image data is available on GitHub at https://github.com/EslamHussein55/Groundwater-Pixel-Prediction-using-ML-tools.

Performance Comparisons
The data in Tables 1-4 are summarized in Figures 6 and 7. For MAE, XGB with all features (including same-pixel, spatial-temporal, and square root rescaling) gives the overall best performance. However, SVR is clearly the best performer for most feature sets. SVR tends to work best with fewest same-pixel features (i.e., configuration a, which is the previous month + 12 months prior). SVR with configuration a + (i, j) reduces the MAE by about 7% over the best result without spatial features. Adding time stamp and square root rescaling gives little additional improvement. In general, SVR gave MAE values that were between 7 and 20 % better than the worst-performing algorithms, which were random forest (for same-pixel and same-pixel + spatial location feature sets) or linear regression (for other feature sets). It is noteworthy that adding a time stamp brought large error reductions to random forest and XGB, while having little effect on linear regression, MLP, and SVR.
The RMSE results resemble the MAE results in that SVR consistently gives the lowest error. This time however, XGB with a + i, j + t + s does not outperform the SVR results. Once again, same-pixel configuration a tends to give the best results for SVR, and a + i, j has about 4% lower error than a only. In general, SVR gave MAE values that were 2.5-15.5% better than the worst-performing algorithms.   respectively. When square root rescaling is used, all predictions reduced MAE and RMSE by over 15% and 20%, respectively. The overall best predictor (a + i + j + t) reduced both MAE and RMSE by more than 20%. This result is consistent with [39], which found that SVM outperformed RF when predicting rainfall images up to 3 h ahead on a per-pixel basis. When GMM was added to XGB with a + i, j + t + s, the values of MAE and RMSE obtained were 2.258 and 4.838, respectively. These values were better than the corresponding best results without GMM by 3.6% and 9.4% respectively. Compared to the untrained predictor, this XGB+GMM model gave 25% improvement in MAE and 29% improvement in RMSE. Figure 9, shows an example of an actual image and its prediction using XGB+GMM. All of the above results were obtained using the default parameters in sklearn for their respective methods. Parameters were not optimized because of the large number of different methods involved. In particular, optimizing GMM is very expensive since it used 100 trained models which would all have to be optimized separately. We did conduct a preliminary investigation into parameter optimization by tuning parameters used in RF and XGB for the models in Tables 1-3. For this purpose, the scikit-optimize package was used, which employs Bayesian optimization. Improvements in MAE and RMSE were less than 6.5%, and still fell short of the performance obtained with GMMs without parameter optimization. Figure 10 gives residual plots and R 2 values for the best XGB+GMM (a + i, j + t + s) model and the best SVR model (a + i, j + t), superimposed on the untrained model residuals. Visually, XGB+GMM and SVR are giving predictions closer to the 45 • line than the untrained regressor, while the R 2 values are more than doubled. Figure 11 presents Regression Error Characteristics (REC) curves for the same three models. For XGB+GMM, 85% of pixel predictions have a deviation of 5 or less.

Conclusions
This paper investigated the automatic prediction of groundwater ∆TWS in the GRACE dataset. The proposed approach uses a regression-based approach to predict full groundwater images based on sequences of monthly groundwater maps.
Our results show that the application of appropriate machine-learning techniques can yield significantly more accurate predictions. In particular, it was shown that using SVR as a predictor, automatically selected previous same-pixel values and time stamp as features, and square root rescaling all contributed to better overall prediction outcomes. Global features constructed from GMMs fitted to the pixel intensity distribution brought further improvements.
In future work we will apply these methods to other regions and meteorological parameters such as rainfall, temperature, air pressure, humidity etc. We shall also explore possible improvements to the method such as better imputation of missing values and the investigation of other global features. Additionally, we shall extend these methods to the joint prediction of multiple parameters.