Long-Term Groundwater Level Prediction Model Based on Hybrid KNN-RF Technique

Reliable seasonal prediction of groundwater levels is not always possible when the quality and the amount of available on-site groundwater data are limited. In the present work, a hybrid K-Nearest Neighbor-Random Forest (KNN-RF) is used for the prediction of variations in groundwater levels (L) of an aquifer with the groundwater relatively close to the surface (<10 m) is proposed. First, the time-series smoothing methods are applied to improve the quality of groundwater data. Then, the ensemble K-Nearest Neighbor-Random Forest (KNN-RF) model is treated using hydro-climatic data for the prediction of variations in the levels of the groundwater tables up to three months ahead. Climatic and groundwater data collected from eastern Rwanda were used for validation of the model on a rolling window basis. Potential predictors were: the observed daily mean temperature (T), precipitation (P), and daily maximum solar radiation (S). Previous day’s precipitation P (t − 1), solar radiation S (t), temperature T (t), and groundwater level L (t) showed the highest variation in the fluctuations of the groundwater tables. The KNN-RF model presents its results in an intelligible manner. Experimental results have confirmed the high performance of the proposed model in terms of root mean square error (RMSE), mean absolute error (MAE), Nash–Sutcliffe (NSE), and coefficient of determination (R2).


Introduction
Groundwater is the most critical source of fresh water that serves about one-third of the world's water demands. Socio-economic development is closely linked with the availability and accessibility of groundwater resources [1]. For instance, 36% of the domestic freshwater supply, 42% of water for agriculture, and 27% of the industrial water demand come from groundwater [2][3][4][5]. While the world's water demand is expected to rise significantly in the future [6], recent studies report an intensive drop in groundwater levels in many parts of the world [7][8][9][10]. Human behavior and the impact of climate change are considered to be the root cause of this [11][12][13]. Nevertheless, the levels of the water tables may also fluctuate seasonally due to the amount of evapotranspiration extracts, hydraulic properties, and other natural events [14][15][16]. Also, diminished precipitation and high temperature can also lead to reduced groundwater levels during dry periods [15]. The increased dependence on groundwater, spatial-temporal variation, and discrepancies of groundwater resources have also impacted ground water levels [6,12,17,18]. This is more evident in sub-Saharan Africa (SSA) where the hydroclimate variability and droughts pose a real challenge to scientists [19]. The intense droughts are having a long-lasting economic impact on the livelihoods of people in sub-Saharan Africa [20].
Despite the high potential of groundwater for socio-economic development in sub-Saharan Africa, there has been little attention paid to this precious resource [21][22][23]. There has been limited data and groundwater research done in sub-Saharan Africa [21], and it is, therefore, very important to increase research on SSA groundwater [21,23]. Rwanda, an East African country, has been experiencing seasonal groundwater scarcity [24][25][26][27], with the eastern province the most susceptible to drought due to their reliance on groundwater as the main source of fresh water [26,27]. It is also projected that eastern Rwanda will probably face prolonged droughts in the future [26]. Little is known about the situation of groundwater resources in Rwanda [24,26,27]. Sensible decision making for water management requires timely, reliable, and actionable information. Improving methods for a precise seasonal forecast of the changing groundwater levels is a strategy towards improving the management of groundwater resources [6,15,[28][29][30].
Advancements in computer modeling, computing power, and information processing have resulted in improved and practical tools for better understanding highly complex natural systems. A large amount of work has been focused on the applicability of machine learning methods to the science of hydrology. Machine learning methods have been proven in their relevance to groundwater studies [31]; however, no single technique is utilized as the available data and scenario will determine the most suitable method for a problem at hand [32]. Many of the methods described in the literature are not suited for sparse and noisy samples [33,34]. Therefore, this study intends to examine the capacity of the KNN-RF ensemble model for the characterization of seasonal responses of the groundwater levels of a permeable fractured aquifer in eastern Rwanda utilizing limited site-data. The remainder of this article is organized as follows: Section 2 elucidates related research, and Section 3 explains the study area and data preparation. The research methodology is outlined in Section 4, while research results and discussion and conclusions are provided in Sections 5 and 6, respectively.

Related Work
Machine learning (ML) based approaches to hydrological modeling are an important area of research. These models are more suited to water resources studies than physical models [30,31]. The simplicity and accuracy of ML is by far better than other models [31,[35][36][37]. The most common methods for prediction of groundwater resources are the artificial neural network (ANN) [35,36,[38][39][40][41][42], the support vector machine (SVM) [43][44][45][46][47], and feature similarity-based approaches such as K-nearest neighbor (KNN) [48]. Nearest neighbor can also be applied with noisy samples [49]. A broad analysis of the application of ANN to the modeling of water resources is presented in the report shared by the ASCE Task force committee [37]. Zhou et al. [50] report a comparative study on ANN and SVM for the modeling of water table depths. The scholars employed discrete wavelets in data preparation, and their results suggest that SVM has higher accuracy scores than the ANN model. Similarly, the SVM method is reported to have a relatively higher score than the adaptive neuro-fuzzy inference system and the ANN [51,52]. Additionally, Natarajan et al. [53] assessed the accuracy of SVM, extreme learning (ELM), genetical programming (GP), and ANN in the simulation of groundwater levels. Their finding suggests that ELM achieves higher precision than GP, SVM, and ANN. A thorough elucidation of the usage of SVM in hydrology can be found in [54][55][56]. The most recent research examines the efficiency of a non-linear auto-regressive network with exogenous input (NARX) in modeling variations of the heights of the water table using precipitation and temperature data [57][58][59]. Wunsch et al. [57] and Guzman et al. [58] reported a high performance of NARX in seasonal predictions of groundwater depths for different types of aquifers, whilst Guzman et al. [59] recommend SVM over NARX for the forecast of daily levels. Random forest (RF) is another ensemble method for the efficient and effective representation of variations in groundwater levels. RF can efficiently handle both small as well as big datasets [60][61][62]. It is one of the most effective data-driven supervised simulation methods that does not overfit the datasets for the modeling of hydrology systems [63][64][65][66]. RF has a small number of parameters to tune; this decreases the pre-processing effort and results in faster computation, which is potent in the water resources application [65,67]. Wang et al. [60] performed short-term forecasting of groundwater levels in data-scarce areas using enhanced random forest. The scholars infused random features with RF to promote the forecasting skill of the model based on temperature and precipitation data. An enlightening discussion of the application of random forest in hydrology is found in a study by Tyralis et al. [66]. In an effort to boost the accuracy of ML models, some studies have attempted to combine multiple techniques. For example, a hybrid method combining ANN, SVM, and adaptive fuzzy inference for the simulation of groundwater levels is discussed in [51,52]. Combining multiple techniques can also boost the predictive competence of ML methods in the presence of limited input data [68,69]. A review of the literature indicates that little research has been done on the seasonal prediction of fluctuations of the groundwater tables. Most studies on the prediction of groundwater levels have focused their attention on short-term predictions. It is more important to deal with acute groundwater periods rather than focusing on annual averages [15]. A small number of studies that have examined the seasonal forecast of groundwater table variations are not designed nor suitable for the SSA environment. Many of the approaches discussed in the literature require extensive and noise-free data for reliable predictions [33]. More effective data-driven forecasting models are crucial for the quantification of groundwater resources [51].
In the study, we employ ensemble KNN-RF with time series preprocessing to predict seasonal variations of the levels of groundwater tables in a data-scarce environment. RF and KNN are non-complex, powerful methods that work efficiently in regression problems when trained using a sufficient number of training examples [65,67,70]. The nearest neighbor technique competently identifies decisive areas in the input data, but with limited training examples, the model may underperform on unseen data. While the greediness nature of RF might cause sub-optimal results for some types of data (i.e., sparse data) [70], randomized trees enhance generalization and overcome the over-fitting issue in the small datasets [71,72]. Provided with limited and noisy examples, we harness the merits of both techniques by combining them in a hybrid manner. By combining approaches the research: (1) Devises and examines the performance of the KNN-RF ensemble scheme under a limited data setting, (2) Characterizes the seasonal response of the permeable fractured aquifer in a temperate region with limited groundwater studies, and (3) Contrasts the proposed KNN-RF model with conventional groundwater modeling techniques (SVM, RF, KNN, and ANN). This study will offer the first ML-based seasonal approximation of groundwater level in Rwanda. The study will also provide new insights about the capacity of the novel KNN-RF approach for sub-Saharan semi-arid conditions.

Case Study and Data Processing
This section discusses the characteristics of the area under investigation, the sources, the nature of the research data, the preparation of the data, and the evaluation metrics used for the current exploration.

Study Area and Data
The investigated well is found in eastern Rwanda, which lies between 29.86875E-29.90625E and 2.30625S-2.26875S with a total area of 9813 km 2 (3789 sq mi). This region is relatively flat with the altitude ranging between 1000 and 1500 m [73]. During the study period between December 2016 and December 2018, the majority of the rain showered in the wet season between March and May (90%), with rainfall ranges between 450 mm and 500 mm. This is less than when compared to other parts of the country, that receive on average between 600 mm and 800 mm annually [74]. The eastern province is characterized by the highest evapotranspiration rate in Rwanda. The average annual temperature varies between 15.70 • C and 24.20 • C. The average minimum temperatures (13 • C-16.65 • C) are recorded in May and June, while the maximum average temperatures (24.3 • C-30.3 • C) are recorded in July and September [74]. The eastern province is the most populated area in Rwanda [75], and is heavily reliant on groundwater as a source of fresh water. The groundwater abstraction rate is 742 m 3 /h, while the demand is estimated to be between 3069 and 7672 m 3 /h [76]. This area has highly heterogeneous types of aquifers. Those aquifers range from low permeable fractured (schist), which is located in Rugarama; permeable fractured (quartzite) located in Mukarange; and fractured (granite), which is found in Ruhuha. The Rwanda Water and Forestry Authority (RWFA) has groundwater stations in each of those three aforementioned areas. A summary of the monitoring well and its features are shown in Table 1. Groundwater data for 2016, 2017, and 2018 were gathered from the Rwanda Water and Forestry Authority (RWFA). Weather records (precipitation, solar radiation, and temperature) are available for a longer period, and researchers decided to use only data matching to the observational period of the groundwater level. Weather data (precipitation, solar radiation, and temperature) for the period of two years, from 3 December 2016, to 30 December 2018, were obtained from nearby weather stations (Kawangire, Kibungo, and Nyagatare) that are operated by the Rwanda meteorological agency (Meteo-Rwanda). Temperature is a daily minimum and maximum observed metric measured in Celsius, daily precipitation is recorded in millimeters, while groundwater level is measured in centimeters obtained from two measurements of groundwater depth per day. For consistency reasons, the groundwater unit was converted into meters. Solar radiation is measured in watt per meter square (W/m) and is included in the study because it influences the evaporation and evapotranspiration [77]. The locations of the groundwater and weather stations in the case study are depicted in Figure 1.  [74]. The eastern province is the most populated area in Rwanda [75], and is heavily reliant on groundwater as a source of fresh water. The groundwater abstraction rate is 742 m /h, while the demand is estimated to be between 3069 and 7672 m /h [76]. This area has highly heterogeneous types of aquifers. Those aquifers range from low permeable fractured (schist), which is located in Rugarama; permeable fractured (quartzite) located in Mukarange; and fractured (granite), which is found in Ruhuha. The Rwanda Water and Forestry Authority (RWFA) has groundwater stations in each of those three aforementioned areas. A summary of the monitoring well and its features are shown in Table 1. Groundwater data for 2016, 2017, and 2018 were gathered from the Rwanda Water and Forestry Authority (RWFA). Weather records (precipitation, solar radiation, and temperature) are available for a longer period, and researchers decided to use only data matching to the observational period of the groundwater level. Weather data (precipitation, solar radiation, and temperature) for the period of two years, from 3 December 2016, to 30 December 2018, were obtained from nearby weather stations (Kawangire, Kibungo, and Nyagatare) that are operated by the Rwanda meteorological agency (Meteo-Rwanda). Temperature is a daily minimum and maximum observed metric measured in Celsius, daily precipitation is recorded in millimeters, while groundwater level is measured in centimeters obtained from two measurements of groundwater depth per day. For consistency reasons, the groundwater unit was converted into meters. Solar radiation is measured in watt per meter square ( / ) and is included in the study because it influences the evaporation and evapotranspiration [77]. The locations of the groundwater and weather stations in the case study are depicted in Figure 1.  The eastern province has the highest number of boreholes and shallow wells in Rwanda. Generally, the eastern part has high-localized fractured aquifers with moderate groundwater yields, as represented in Figure 2. Alluvium based aquifers are mostly connected to fast-flowing rivers. These aquifers exhibit high groundwater potential in the eastern province [76]. The studied aquifer (Mukarange) is made of quartzite rocks fused on a schist base with a relatively high yield. The eastern province has the highest number of boreholes and shallow wells in Rwanda. Generally, the eastern part has high-localized fractured aquifers with moderate groundwater yields, as represented in Figure 2. Alluvium based aquifers are mostly connected to fast-flowing rivers. These aquifers exhibit high groundwater potential in the eastern province [76]. The studied aquifer (Mukarange) is made of quartzite rocks fused on a schist base with a relatively high yield.

Data Preparation
The state of the input data is one of the key factors that determines the level of accuracy of MLbased predictions. Preprocessing and rectification of the variables ensure that all features receive equal attention throughout the training process [17,59,78]. A total of 759 daily observations from each of the above-mentioned stations were analyzed and prepared before application to the designated task [79]. Water level data were available on 12 h intervals and precipitation and temperature data were on a 24 h basis. In order to set common time intervals, we converted temperature and levels to daily averages. Data preparation was carried out with Python (3.6.6) programming language using Pandas, Numpy, Matplotlib, and SciPy data analysis libraries [80]. The RF, SVR, ANN, KNN, and

Data Preparation
The state of the input data is one of the key factors that determines the level of accuracy of ML-based predictions. Preprocessing and rectification of the variables ensure that all features receive equal attention throughout the training process [17,59,78]. A total of 759 daily observations from each of the above-mentioned stations were analyzed and prepared before application to the designated task [79]. Water level data were available on 12 h intervals and precipitation and temperature data were on a 24 h basis. In order to set common time intervals, we converted temperature and levels to daily averages. Data preparation was carried out with Python (3.6.6) programming language using Pandas, Numpy, Matplotlib, and SciPy data analysis libraries [80]. The RF, SVR, ANN, KNN, and KNN-RF models for prediction of groundwater levels were also realized in Python using the Scikit-Learn machine learning library (version 0.20) [81].
Water level data were available between 3 December 2016, and 30 December 2018. Weather data, including precipitation, temperature, and solar radiation data between 2010 and 2018 were acquired, and weather data between December 2016 and December 2018 that correlated to the water level data were utilized in the experiment. Evapotranspiration is one of the key factors that influence groundwater level oscillations [17]. Since evapotranspiration data were not available, solar radiation data were successfully substituted instead, as suggested in [82,83]. Temperature and groundwater level (GWL) data were converted to mean values to reduce variance among data points as recommended by [84,85]. During the analysis of the data, it was discovered that groundwater data had irregular patterns. Then, the time-series data filtering method was also used to improve the quality of that data. The exponential weighted moving average (rolling mean) produced a superior output of the groundwater level samples. This not only filtered the data, but also revealed long term trends from the data. The exponential rolling mean of a sequence S = (x 1 , x 2 , . . . , x k ), is: where S k is the filtered data, k is the size of S, α is the decay in the interval [0, 1], idc is the initial value of the decay, and x is the input data. As the exponential weighted average is enumerated, the decay α value decreases exponentially in such a way the most recent observations are assigned higher weights than the old ones. For a proper scaling of the time-series data, for each stage, the features (X j ) were converted in the range between −1 and 1 with the formula: where X s stands for standardized value, and X max and X min are the maximum value and minimum value of the features to be scaled, respectively. Despite the effort made to improve the samples, data from two of the observation boreholes (Ruhuha and Rugarama) were found to be unusable. Only data from one borehole (Mukarange) with 759 observations was considered for the current investigation. The groundwater level data were matched with weather data recorded in the same time period ( ). This could be connected to the higher evaporation rate, reduced replenishment, and increased groundwater withdrawals due to excessive temperatures in the studied area. Conversely, higher GWLs are observed during wet periods (March-May and September-December), which can be attributed to increased groundwater restoration and reduced abstractions.

Model Performance and Evaluation Measures
It is important that the prediction model is properly evaluated to assess its performance [87,88]. We estimated the predictive ability of the KNN-RF ensemble model on groundwater levels and evaluated it against SVM, KNN, RF, and ANN models based on mean absolute error (MAE), root mean square error (RMSE), the Nash-Sutcliffe efficiency coefficient (NSE), and the coefficient of determination ( ). MAE, RMSE, and were selected because they limit the bias of models against acute events [88]. In addition, MAE and RMSE provide a finer comparison between models, especially in data-scarce situations [88]. NSE is another efficient coefficient used to gauge the relative magnitude of residual variance against the variance of observational data [46,88,89]. MAE, RMSE, and range between 0 and 1, while NSE is between − and 1. The highest agreement between the estimated and observed values is reached when MAE = 0, RMSE = 0, NSE = 1, and =1. All measurements of the performance of the models were conducted using the hydrostats library (a Python package designed distinctively for hydrology studies) [90]. MAE, RMSE, NSE, and are defined as:

Model Performance and Evaluation Measures
It is important that the prediction model is properly evaluated to assess its performance [87,88]. We estimated the predictive ability of the KNN-RF ensemble model on groundwater levels and evaluated it against SVM, KNN, RF, and ANN models based on mean absolute error (MAE), root mean square error (RMSE), the Nash-Sutcliffe efficiency coefficient (NSE), and the coefficient of determination (R 2 ). MAE, RMSE, and R 2 were selected because they limit the bias of models against acute events [88]. In addition, MAE and RMSE provide a finer comparison between models, especially in data-scarce situations [88]. NSE is another efficient coefficient used to gauge the relative magnitude of residual variance against the variance of observational data [46,88,89]. MAE, RMSE, and R 2 range between 0 and 1, while NSE is between −α and 1. The highest agreement between the estimated and observed values is reached when MAE = 0, RMSE = 0, NSE = 1, and R 2 =1. All measurements of the performance of the models were conducted using the hydrostats library (a Python package designed distinctively for hydrology studies) [90]. MAE, RMSE, NSE, and R 2 are defined as: Hydrology 2020, 7, 59 9 of 24 where Q E is the estimated change in groundwater level, Q A is the actual or observed groundwater level, and n is the total number of input data points.

Methodology
The principal ambition of this study was to propose a decisive model to characterize the seasonal response of the fractured aquifer in eastern Rwanda, through quantification of seasonal deviations in water table depths in data-scarce situations. There is a requirement [19,24,26] for accessible and simple tools that offer actionable insights for the adaptive management of groundwater resources on a seasonal basis. With that requirement in mind, we propose to estimate seasonal groundwater levels using an innovative ensemble KNN-RF model with an exponentially weighted average preprocess of three predictors (solar radiation, precipitation, and temperature). The workflow of predictive modeling and validation setup is illustrated in Figure 6. where is the estimated change in groundwater level, is the actual or observed groundwater level, and is the total number of input data points.

Methodology
The principal ambition of this study was to propose a decisive model to characterize the seasonal response of the fractured aquifer in eastern Rwanda, through quantification of seasonal deviations in water table depths in data-scarce situations. There is a requirement [19,24,26] for accessible and simple tools that offer actionable insights for the adaptive management of groundwater resources on a seasonal basis. With that requirement in mind, we propose to estimate seasonal groundwater levels using an innovative ensemble KNN-RF model with an exponentially weighted average preprocess of three predictors (solar radiation, precipitation, and temperature). The workflow of predictive modeling and validation setup is illustrated in Figure 6. Potential features are selected from the collected data. Solar radiation, precipitation, temperature, and GWL data are refined and scaled for proper format and, subsequently, the candidate machine learning models are chosen. The two models (RF and KNN), both of which are non-complex and capable of working with both small and big datasets [66,70,91], are combined to overcome the disadvantages of small datasets as well as enhance predictive accuracy. In the final step, the models are tested and compared against NSE, MAE, RMSE, and using both estimated and observed groundwater data. A rolling window testing and validation method is employed and the most effective and useful model is determined based on the performance contrasts. This is most suited for the temporal nature of time series data [92], since the size of training and validation sets are sampled with respect to the desired forecast length (corresponding to 15, 30, 60, and 90 days) at the end of the series. A model that can be easily adopted using the available information, particularly in resource scarce areas, will be most feasible and practical for sensible decision making on groundwater resources. Potential features are selected from the collected data. Solar radiation, precipitation, temperature, and GWL data are refined and scaled for proper format and, subsequently, the candidate machine learning models are chosen. The two models (RF and KNN), both of which are non-complex and capable of working with both small and big datasets [66,70,91], are combined to overcome the disadvantages of small datasets as well as enhance predictive accuracy. In the final step, the models are tested and compared against NSE, MAE, RMSE, and R 2 using both estimated and observed groundwater data. A rolling window testing and validation method is employed and the most effective and useful model is determined based on the performance contrasts. This is most suited for the temporal nature of time series data [92], since the size of training and validation sets are sampled with respect to the desired forecast length (corresponding to 15, 30, 60, and 90 days) at the end of the series. A model that can be easily adopted using the available information, particularly in resource scarce areas, will be most feasible and practical for sensible decision making on groundwater resources.

K-Nearest Neighbor
K-Nearest Neighbor is a simple and robust method for regression and classification. Based on the proximity to the training data points, untrained data point(s) can be approximated utilizing the KNN method [91,93]. While attained observations are incomplete and noisy, the KNN technique is one of the best methods for ML-based forecasting [33,34]. Using this method, the most influential areas can also be identified from noisy samples. For continuous data, matching of points is done with respect to the distance measured using either Minkowski, Euclidean, Chebyshev, or Manhattan metrics. Suppose, there are two sets, X and Y, and each of them has t number of items, such that X = (x 1 , x 2 , . . . , x t ), and Y = (y 1 , y 2 , . . . , y t ) so long as ( j = 1, 2, 3, . . . , t). Then the distance between the desired data point and the nearby points can be enumerated. The distance between the desired point to the closest points is then defined as:D where r is a positive real number, andD is the calculated distance.

1.
To anticipate the target value, we perform the following steps: 2.
Use Equation (8) to calculate the distance between a new sample and each of the adjacent points.

3.
Sort all values calculated in step 1 by increasing order.

4.
Utilize the greedy search technique to determine the optimal value of K, based on RMSE.

5.
Enumerate an inverse distance weighted mean using K neighboring examples. 6.
Return average as the approximated value.
In the above scheme, K is a user-configurable parameter that represents the number of contiguous features to be included in the calculation of average votes. The prediction of variations in groundwater levels is obtained as the average weighted distance between samples.

Artificial Neural Network
The artificial neural network is a combination of multiple interconnected neurons that learn cardinal relationships in a set of data in the same way as the human brain operates [37,94,95]. Interconnected neurons make an input layer, one or more hidden layers, and an output layer. As input data are fed through the input layer, neurons in the hidden layer(s) compute the output using connection weights and bias. One of the two stages in which ANNs are used is in the training phase. In this phase, a training algorithm such as conjugate gradient momentum, Levenberg-Marqardt, backpropagation, Adam, gradient-descent, or Bayesian regularization is selected and the suitable connection weights are determined. The feed-forward network was trained using the back-propagation method to avoid an over-fitting issue. Another stage is the real application of the trained neural network. The estimated value(s) is then obtained as: where x i is the input examples, Y e is the approximated output, b is bias term, f is an activation function, and w i is the weight of the vertices. In a three-layered multilayer perceptron network, the transformation of the weighted inputs is accomplished using a rectifier linear unit (ReLU), which is defined as: where x r symbolizes the transformed input passing in the hidden layer, and x i is the raw input. All x i values greater than zero are mapped to their respective y-values, while all x i values less than zero are assigned to zero. This makes the ReLU computational modest and able to efficiently handle negative inputs, and also offers smoother optimization [96].

Support Vector Machine
The support vector machine for regression problems is termed as a support vector regression (SVR). SVRs are supervised learning techniques introduced by Cortes and Vapnik [97]. These are powerful methods that utilize structural risk minimization to obtain optimal solutions [97,98]. SVR accomplishes risk minimization measures using a set of several input vectors while conducting an estimation of non-linear targets through regression processes [98,99]. Based on the assumption that, there is a relationship between the dependent variable y and independent variables (X 1 , X 2 , X 3 , . . . , X t ), the SVR model estimates a function f (x), which determines the target values y i plus the admissible error . In an SVR model, data processing is conducted in a hyperplane and it starts as a linear transformation of the time-series. The linear representation of an SVR algebraic function is given by: where b R, x, w x, and w.x are the bias, inputs signals, weight vector, and the dot product between x and w, respectively. Then to minimize the norm w 2 = w.x , we need to find the smallest possible value of w as: subject to: As the primary goal is to get a function f(x) that can be used to calculate a set of the observed x i and the estimated y i values with the level of accuracy bounded by , this is realized by minimizing a regularized risk function in (Equation (13)) with constraints stipulated in inequalities given below (Equation (14)). Two relaxed variables (ξ, ξ * ) are incorporated in (Equation (15)) to allow for some error tolerance.
subject to: where t is the total number of model input features, and C is a user-configurable parameter that manages the influence of each supporting vector in the generalization and stability of the SVR model. Ultimately, Equation (12) can be reformulated as: where k x, x i is the kernel, and α i , α i * are Lagrangian multipliers. Before applying SVR in data processing, the appropriate kernel and support vectors should be determined. The SVR kernel provides mapping of the non-linear features in a high dimensional space while converting it into a normal linear format. Thus, SVR suits well for complex interrelationship among features in environmental modeling [59]. There are several types of kernels, such as polynomial, multi-layer perceptron, exponential, and Gaussian radial basis function. Radial basis essence has shown a commended performance in hydrologic studies [99,100]. The exponential radial bias function is given by:

Random Forest
Based on decision trees with the application of bootstrap aggregation, random forests (RF) for regression problems and classification were introduced by Breiman [101]. A forest of diverse trees is developed using randomly chosen features selected from random subsets of the original training data. As a large number of trees is produced, classification results are obtained from the popular class, while in regression problems, the result is computed as the average value obtained from all the individual regression trees [102,103]. In the current study, we focus on the regression type of RF. A forest may contain several trees as specified by the user. Suppose the number of trees in the forest is denoted by M. The random forest method works in the following manner:

1.
Randomly fetch different subsets x i from a given dataset X.

2.
Use sampled data to create M decision trees.

3.
Enumerate average of the votes from the decision trees.

4.
Return the average as the final approximated value.
Randomness is applied at two levels of the random forests: during data selection and in attribute selection. Since the regression trees are created from random vectors selected from training dataset X, each leaf-node contains a constant estimate of Y. As an example, the data points (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x M , y M ) are selected as samples for the leaf-nodes, and the anticipated data can be modeled as the averaged predictions from all the individual regression trees as: such that j = 1, 2, . . . , M, and whereŷ x j is the estimated result. In RF, tuning-parameters have a great effect on the ability of the model [78]. The most important tuning parameters for RF in Scikit-Learn are n_estimators, random_state, n_jobs, min_sample_leaf, and max_features [78].

KNN-RF Ensemble
Focusing on the improvement of seasonal predictions, a hybrid KNN-RF technique is developed and validated. As indicated in the Introduction section, the RF and KNN methods have good data-representation ability; however, these methods do not perform optimally when fed with tiny datasets. To overcome this limitation, we merge the above models in a hybrid manner. The two base regressors, KNN, and RF are fitted on the whole training set and, using the test set, the models yield predictions individually. The results are then averaged to produce the final result. The final result of the KNN-RF ensemble is given by: where µ(x) is the final weighted average result of the ensemble model, ω n is the weight allocated to the n th regressor, which is based on the MSE performance; p n is the prediction from n th model; x is the sample data points. The ensemble based on the KNN-RF method enhances predictive performance in the following aspects [104].

1.
Supports using fewer samples to adequately represent data distribution.

Controls variance in a small dataset. 4.
Relieves the processing burden for model selection.
Uniform weights are assigned to all estimators in the KNN-RF model. The base RF model is set to perform bootstrapping on the training subset, which reduces similarities in the trees. This, therefore, benefits the performance of the model provided that a small number of training examples are accessible. The KNN base model is set to use the distance between data points as the proximity criterion. Tuning parameters for KNN-RF are leaf_size, metric, random_ state, n_jobs, n_neighbors, p, and weights.

Tuning Parameter and Input Selection
Based on the performance gains of the models, the appropriate tuning parameters are selected to establish the proper architecture for each of the model during the training phase. Each input arrangement was assessed for the following parameters in Scikit-Learn. For KNN, the Chebyshev, Minkowski, Euclidean, and Manhattan length metrics were tested. The Minkowski measure emerged as the best choice. [105]. The ideal value of K was found using a grid-search procedure [106], and it varied with the sample size, which was determined by the prediction range (more details about the portions of the sample used for the adjustments and testing of the models are given in the next subsection). For ANN, the trial-and-error technique was used to determine the finest number of hidden layer neurons based on the least RMSE [53,107,108]. Fourteen hidden neurons produce the best output. The adaptive learning (Adam) optimization scheme was found to be the most suitable for the dataset used in the current investigation [109]. The ReLU, linear, and tanh activation functions were tested and resulted in ReLU enumerating the most precise results [96,110].
Considering the SVR, the most appropriate values of the epsilon, cost, gamma, and the kernel (rbf, poly, sigmoid) were verified using the trial-and-error technique [53] producing the best factors of 0.01, 1.0, scale, and RBF, respectively [111]. For RF and KNN-RF, the ideal number of estimators is enumerated using the grid search procedure [112]. The number of learners affects the processing speed of the model. Whilst a large number of learners improves the reliability of the model, it also slows down processing speed [78]. The ideal number of estimators is 200, max_depth of the trees is 15, leaf_size is 30, max_feature is n _feature, min_sample_leaf is 1, random_state is none, n_jobs is 1, and min_split is 2. Similarly, the n_neighbors is 3 when the prediction period is 15 or 30 days, and 2 when the prediction period is 60 or 90 days, p is 2, metric is Minkowski, random _ state is 0, and weights is distance.
For proper and comparable evaluation of the models, the input series of the time lagged precipitation, groundwater level, temperature, and solar radiation were arranged in twelve combinations: P (t − 1) L

Training and Testing of the Model
When it comes to the training and evaluation of the models, time series predictive modeling has numerous distinctive traits and peculiarities that need a different approach to supervised learning problems [92]. There are intrinsic interrelationships between the data points measured across time, and during the training and testing of the models, the temporal structure of the series needs to be maintained. Typically, the arbitrary splitting of the timeseries dataset from different points in time is irrelevant to time-based data because it causes inherent biases [92]. The rolling window or walk-forward validation is best suited for time series-based forecasting as it facilitates updating of the predictions as new data come in. In this approach, the holdout values are sampled at the end of the dataset temporally. Figure 7 shows a graphic demonstration of the rolling windows validation procedure. The size of the holdout sample is determined by the prediction scope and, therefore, the width of the rolling window is equal to the desired forecast length. The training and validation of the KNN-RF, ANN, SVM, RF, and KNN models was conducted with the rolling window technique. It was completed using four different portions of training and holdout values corresponding to the prediction of 15-day-ahead (t + 15), 30-day-ahead (t + 30), 60-day-ahead (t + 60), and 90-day-ahead (t + 90). The training and validation percentages for these prediction horizons were 88.1423 − 11.8577, 92.0949 − 7.9051, 96.0473 − 3.9527, and 98.0237 − 1.9763, respectively. The trained models that were used for prediction are explained in the next subsection.

Prediction of Seasonal Changes in Groundwater Depths
In this work, seasonal forecasting is the forecast of 90 days lead-time groundwater level variations. For comparison, other prediction periods of 15, 30, and 60-day were also evaluated. As previously explained, the 15, 30, 60, and 90 days predictions were implemented by changing the size of the hold-out sample.

Experimental Results and Discussion
The predictive capacity of the KNN-RF technique was investigated and the results were compared to the four general models. The results of the 15, 30, 60, and 90 days lead-time groundwater level predictions at the Mukarange borehole using the KNN-RF, RF, SVM, ANN, and KNN models are presented in Figure 8. According to Figure 8, at all horizons the KNN-RF model achieved the best performance with respect to ￼NSE, MAE, RMSE and values. For this model, the RMSE values range between 0.0030 and 0.0035, while NSE values were between 0.913 and 0.9741 during the validation stage. The KNN model obtained the best results for the short term (15-30 day-ahead), while RF obtained improved accuracy for long-range (60-90 day-ahead) estimations. The SVR model tried to catch up with the long changes of the levels and outperformed the ANN model. Similar findings were reported in the studies that compared the above methods for the modeling of groundwater tables [50,113,114]. At all lead times, the ANN method overpredicted the observed values. The low performance of the ANN method in training and testing phases on small-sized samples could be attributed to the data requirements of this model [115]. Compared with the RF, ANN, and SVM models, the KNN model had higher performance scores. This is in contrast with the outcomes reported by Rahmati et al. [116]. Meanwhile, RF is found to be superior to the SVM model, which is consistent with the conclusion made by Naghibi et al. [117]. The size of the holdout sample is determined by the prediction scope and, therefore, the width of the rolling window is equal to the desired forecast length. The training and validation of the KNN-RF, ANN, SVM, RF, and KNN models was conducted with the rolling window technique. It was completed using four different portions of training and holdout values corresponding to the prediction of 15-day-ahead (t + 15), 30-day-ahead (t + 30), 60-day-ahead (t + 60), and 90-day-ahead (t + 90). The training and validation percentages for these prediction horizons were 88.1423 − 11.8577, 92.0949 − 7.9051, 96.0473 − 3.9527, and 98.0237 − 1.9763, respectively. The trained models that were used for prediction are explained in the next subsection.

Prediction of Seasonal Changes in Groundwater Depths
In this work, seasonal forecasting is the forecast of 90 days lead-time groundwater level variations. For comparison, other prediction periods of 15, 30, and 60-day were also evaluated. As previously explained, the 15, 30, 60, and 90 days predictions were implemented by changing the size of the hold-out sample.

Experimental Results and Discussion
The predictive capacity of the KNN-RF technique was investigated and the results were compared to the four general models. The results of the 15, 30, 60, and 90 days lead-time groundwater level predictions at the Mukarange borehole using the KNN-RF, RF, SVM, ANN, and KNN models are presented in Figure 8. According to Figure 8, at all horizons the KNN-RF model achieved the best performance with respect to NSE, MAE, RMSE and R 2 values. For this model, the RMSE values range between 0.0030 and 0.0035, while NSE values were between 0.913 and 0.9741 during the validation stage. The KNN model obtained the best results for the short term (15-30 day-ahead), while RF obtained improved accuracy for long-range (60-90 day-ahead) estimations. The SVR model tried to catch up with the long changes of the levels and outperformed the ANN model. Similar findings were reported in the studies that compared the above methods for the modeling of groundwater tables [50,113,114]. At all lead times, the ANN method overpredicted the observed values. The low performance of the ANN method in training and testing phases on small-sized samples could be attributed to the data requirements of this model [115]. Compared with the RF, ANN, and SVM models, the KNN model had higher performance scores. This is in contrast with the outcomes reported by Rahmati et al. [116]. Meanwhile, RF is found to be superior to the SVM model, which is consistent with the conclusion made by Naghibi et al. [117]. From Tables 2 and 3, the highest accuracy of the KNN-RF for different horizons was achieved with precipitation (P-1), solar radiation S(t), temperature T(t), and groundwater level L(t) time-lags in the testing phase. The most accurate outcomes are shown for the prediction at 15-days ahead. It is also seen that the accuracy of the forecasted results declines with the length of the prediction. These results are in corroboration with the studies in [58,60]. Conversely, the 90-day prediction obtained better results than 60-day prediction. The largest difference between the MAE and RMSE values is perceived for the 60-day predictions. The criterion values showed higher importance for longer lead-times, while NSE provided the overall description of the predictive power of the models. T represents temperature, L represents groundwater level, S represents solar radiation, and P represents precipitation. The highest and lowest MAE and RMSE are in bold.  From Tables 2 and 3, the highest accuracy of the KNN-RF for different horizons was achieved with precipitation (P−1), solar radiation S(t), temperature T(t), and groundwater level L(t) time-lags in the testing phase. The most accurate outcomes are shown for the prediction at 15-days ahead. It is also seen that the accuracy of the forecasted results declines with the length of the prediction. These results are in corroboration with the studies in [58,60]. Conversely, the 90-day prediction obtained better results than 60-day prediction. The largest difference between the MAE and RMSE values is perceived for the 60-day predictions. The R 2 criterion values showed higher importance for longer lead-times, while NSE provided the overall description of the predictive power of the models. Figure 9 delineates the results of the comparison of the relationships between the actual and KNN-RF estimated groundwater levels for different horizons. These results show that there is high association between the actual and estimated levels for all four-time horizons. The 15-day range exhibits the largest value of R 2 , since most of the predictions are closer to the straight line. It was also found that the 60-day prediction range showed the relatively lowest value of R 2 compared to other ranges. This also supports the results presented in Figure 8, which showed higher error values for the 60-day prediction stage than those of the other predictions. Similarly, the hydrographs in Figure 10 show that the KNN-RF technique reproduced and fairly represented the seasonal oscillations of the depths of the groundwater tables. However, it is quite obvious that the 60-day prediction is not as accurate as the predictions for other time-horizons.   T represents temperature, L represents groundwater level, S represents solar radiation, and P represents precipitation. The highest are in bold. Figure 9 delineates the results of the comparison of the relationships between the actual and KNN-RF estimated groundwater levels for different horizons. These results show that there is high association between the actual and estimated levels for all four-time horizons. The 15-day range exhibits the largest value of , since most of the predictions are closer to the straight line. It was also found that the 60-day prediction range showed the relatively lowest value of compared to other ranges. This also supports the results presented in Figure 8, which showed higher error values for the 60-day prediction stage than those of the other predictions. Similarly, the hydrographs in Figure  10 show that the KNN-RF technique reproduced and fairly represented the seasonal oscillations of the depths of the groundwater tables. However, it is quite obvious that the 60-day prediction is not as accurate as the predictions for other time-horizons.  The results obtained have also demonstrated the significant role of input and tuning parameter selection. The combinations of solar radiation, temperature, precipitation, and previous groundwater level with an appropriate time-lag improved the seasonal estimation of the groundwater heights. It was found that for the KNN-RF technique, the lagged precipitation improved NSE, MAE, RMSE, and scores, for 32.6%, 53.19%, 51.57%, and 27.38%, respectively. This suggests that there is a huge potential for the infiltrated and percolated rain water to raise the groundwater table for the Mukarange aquifer.
For feature selection, Lindsey et al. [82] and Kelly et al. [83] showed that the use of solar radiation as a substitute for evapotranspiration as a suitable option for capturing the dynamics of groundwater depth. Our results confirmed the high influence of solar radiation on the long-term variability of groundwater levels in the semi-arid area.
Results also confirmed that tuning parameters have a great influence on the model's final results. These parameters led to an improved generalization capability for all models. Considering the SVR, it was found that RBF yielded the best performance, while epsilon and gamma are the most influential parameters for the determination of the appropriate architecture of the SVR model. This is in corroboration with the findings in [59]. With six (6) input features, the finest number of nodes in the The results obtained have also demonstrated the significant role of input and tuning parameter selection. The combinations of solar radiation, temperature, precipitation, and previous groundwater level with an appropriate time-lag improved the seasonal estimation of the groundwater heights. It was found that for the KNN-RF technique, the lagged precipitation improved NSE, MAE, RMSE, and R 2 scores, for 32.6%, 53.19%, 51.57%, and 27.38%, respectively. This suggests that there is a huge potential for the infiltrated and percolated rain water to raise the groundwater table for the Mukarange aquifer.
For feature selection, Lindsey et al. [82] and Kelly et al. [83] showed that the use of solar radiation as a substitute for evapotranspiration as a suitable option for capturing the dynamics of groundwater depth. Our results confirmed the high influence of solar radiation on the long-term variability of groundwater levels in the semi-arid area.
Results also confirmed that tuning parameters have a great influence on the model's final results. These parameters led to an improved generalization capability for all models. Considering the SVR, it was found that RBF yielded the best performance, while epsilon and gamma are the most influential parameters for the determination of the appropriate architecture of the SVR model. This is in corroboration with the findings in [59]. With six (6) input features, the finest number of nodes in the hidden layer of the ANN was found to be 14, which is consistent with the conclusion reached by Kayzoglu et al. [118]. Whilst the adaptive learning scheme and the ReLU function are commonly used with large datasets, our results suggest that these parameters can also work well with a limited dataset. One of the possible reasons for this outcome is the sparsity of the available samples. We found that limiting the number of learners to 200 and the depth of the trees to 15 had positive effects on the generalization ability of the RF model and overcame the overfitting issue on the Mukarange dataset. The best results for the KNN-RF, SVM, and ANN methods were achieved with the structures presented in Table 4. In general, the KNN-RF model has experimentally demonstrated its superiority to classical models. For example, despite the sparseness of the available dataset, multi-step prediction results obtained by the KNN-RF model outperformed SVR, KNN, RF, and ANN scores. The researchers recognize some limitations to the approach applied in this study as, for example, the size of the dataset might not properly represent the distribution of climatic and groundwater data.

Conclusions
The performance and capacity of the ensemble KNN-RF regression approach to the prediction of seasonal groundwater levels for the fractured aquifer with limited data has been examined in this study. Groundwater level data and its significant meteorological drivers (solar radiation, temperature, and precipitation) collected from Mukarange in eastern Rwanda were used for the analysis. From the experimental analysis, it was found that the KNN-RF ensemble approach is stable with enhanced generalization competence and prediction accuracy. The results also indicated that, by using the sliding window validation procedure, the KNN-RF model captured slightly well with the time-based changes in the depths of the groundwater tables. Inclusion of the solar radiation as a substitute for evapotranspiration resulted in an improved prediction accuracy. The results of the study suggest that KNN-RF is well-suited for the forecasting of seasonal variations in groundwater depths with limited samples. The values of the analytical measures showed that, in all prediction ranges, the KNN-RF technique achieved the most promising results compared to those obtained by the ANN, KNN, SVM, and RF models. The NSE and R 2 values of the ensemble KNN-RF technique were higher than those yielded by the above methods, and the values of RMSE and MAE of the ensemble KNN-RF were smaller than those produced by the other methods. The research used data from one groundwater observation station over a short duration, and it has been concluded that more data could improve the predictive accuracy of the model. This can be achieved simply and effectively by updating the model as data become available, since a sliding window method has been used. In addition, the KNN-RF model was shown to be an advanced alternative to the SVR, KNN, RF, and ANN models. The results from this study would be useful for the planning and management of groundwater resources. Our proposed model could be readily transferable or adapted to other areas, specifically those with similar aquifers where the availability and quantity of data is challenging. In order to address data scarcity issues, the authors intend to establish a low-cost, low power wireless sensor network for near real-time monitoring of groundwater levels in eastern Rwanda.  Acknowledgments: Data used in this study were collected from two sources. Groundwater level data for Mukarange, Ruhuha, and Rugarama aquifers in Eastern Province in Rwanda were collected from RWFA website (available at https://waterportal.rwfa.rw) in comma-separated file format. Groundwater data were sampled twice a day. Climatic data for Nyagatare, Kibungo, and Kawangire in Eastern Province in Rwanda were obtained from Meteorological Agency of Rwanda (Meteo-Rwanda) in comma-separated file format. Finally, the authors gratefully acknowledge the constructive suggestions given by the anonymous reviewers and the editor.

Conflicts of Interest:
The authors declare that there is no conflict of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: