Groundwater Level Modeling with Machine Learning: A Systematic Review and Meta-analysis

: Groundwater is a vital source of freshwater, supporting the livelihood of over two billion people worldwide. The quantitative assessment of groundwater resources is critical for sustainable management of this strained resource, particularly as climate warming, population growth


Introduction
Groundwater is the largest global reservoir of liquid freshwater, which is under increasing stress due to overdraft [1].Groundwater is "the water stored beneath earth's surface in soil and porous rock aquifers" [2], and plays a principal role in sustaining ecosystems and producing food in a vast area of arid and semi-arid land globally [3].Groundwater accounts for around 33% of total worldwide water withdrawals [4], and over two billion people rely on groundwater as their main water source [5].Over-drafting is causing groundwater levels to drop continuously and dramatically in many regions, leading to a global groundwater crisis [2,6,7].
To address the challenges of sustainable groundwater management, it is crucial to have a good understanding of the current status and to forecast future estates of this indispensable resource.There are numerous mechanistic groundwater models, for example using finite difference and finite element techniques to simulate the dynamic behavior of a groundwater system [8][9][10][11], such as MODFLOW [12][13][14].Numerous studies also applied soft computing techniques for groundwater level or contamination prediction, including GA [15][16][17], ANN [18][19][20][21], and ANFIS [22][23][24][25][26]Generally, physical and numerical models have been the main tool in modeling and forecasting the groundwater level.However, because these traditional methods rely on various inputs and the underlying mechanisms are usually too complicated to grasp, data-driven approaches are used in several recent studies [27,28].
In recent years, there has been a growing interest to employ ML and data-driven approaches to groundwater modeling [29][30][31][32][33][34][35][36][37][38].Due to the complex nature of groundwater problems, resolving all governing processes is very difficult, and simulation and prediction models are constrained with numerous simplifications and assumptions and endure significant uncertainties [39].The application of black-box models, such as ML, that can resolve the nonlinear interdependencies of all influential input variables, without the need for complete knowledge of underlying physical or mathematical processes, is appealing [33,40].Moreover, novel strategies such as linear stochastic approaches and pre-processing techniques have recently been proved to be promising in groundwater level forecasting [27].
This study attempts to systematically review the state-of-the-art application of ML methods in the modeling and prediction of groundwater resources.By conducting a rigorous meta-analysis on the congregated results, this study investigates the suitability of ML models to predict the quality and quantity of groundwater resources.Although the scope of this systematic review is not limited to any specific characteristic, its focus is on groundwater level prediction, as it is by far the most popular application of data-driven techniques in groundwater studies [21,[41][42][43].This study builds upon the previous review articles on the application of ML and deep learning models in hydrology, water resources, and groundwater [44][45][46][47], and bridges the gap for a comprehensive, consistent, and systematic meta-analysis of various ML models in studying groundwater.ML studies of groundwater heavily vary in spatial and temporal scale, background meteorology, ML model construction, sample division, and input variables.As a result, predicted groundwater indices, their spatiotemporal resolution, and their forecast lead time vary widely.Consequently, a robust comparison of the performance of ML models in monitoring and forecasting groundwater characteristics can be challenging.A systematic meta-analysis makes these inter-study comparisons possible by communicating through a pooled summary of combined individual study results [48].The current study fills these research gaps by following the CEBC protocol for conducting a systematic review [49].According to CEBC, "A Systematic Review is an evidence synthesis method that aims to answer a specific question as precisely as possible in an unbiased way" [49].We pose the question: how accurately can ML methods model and predict groundwater resources' quantitative characteristics?By answering this question through a meta-analysis, we aim to cast light on the performance of ML methods in groundwater resources studies.

Methodology
Formulating a well-focused and clearly-framed question is the first and one of the most important steps in the systematic review process.Without a pre-defined question and inclusion-exclusion criteria, it can be challenging and time-consuming to identify appropriate resources and search for relevant literature.Following the procedure developed by the CEBC, we used a specialized framework, called Population, Intervention, Comparator, Outcome-PICO, to form the question systematically and facilitate the literature search [49].Here, PICO was defined as:


Population: time series of groundwater resources' quantity or quality characteristics  Intervention: regression ML algorithms  Comparator: observation and measurement  Outcome: predictive capabilities (through quantitative measures of performance like the coefficient of determination) Using the PICO framework, we designed a search string and used it to search title, abstract, and keywords of literature through two online databases: "Scopus" and "Web of Science".We used the same search string for both databases simultaneously to avoid any discrepancies.The literature sample was drawn from English, peer-reviewed journal articles, and conference proceedings published between January 2010 and September 2020.The process of searching was performed on 21 September 2020.Adopting a high-sensitivity and low-specificity approach search strategy, the search string was designed to encompass all regression ML methods that have been used in hydrology and hydrogeology, excluding ML methods that are specific to classification.Many articles were initially identified but removed later at the title and abstract screening stage (Figure 1).The search string is presented in the supplementary material.Adding up the records from both databases, a total of 5762 articles were identified to meet the search string criteria and were stored in a reference manager software (Mendeley).Since we used two databases, there were a considerable number of duplicate articles, and we used two methods to deal with this: an automatic duplicate removal process, which was conducted through Mendeley, and to check the reliability of this process, in parallel, we checked if the title of the records and their DOI was identical in Excel and removed the duplications accordingly.We also used the Fuzzy Lookup procedure in Excel to find similar titles (i.e., titles of the same article with different wordings).
After duplicate removal, 3677 records were retained in the next step: title screening.Assessment of the titles was undertaken by two reviewers, simultaneously and independently, qualifying articles to be retained or removed.If both reviewers agreed to either keep or remove a specific record, the final decision was the agreement.However, in case of a conflict in decisions, a third reviewer checked the record and made the final decision to either keep or remove it.In the end, 878 records remained for the next round of review, namely abstract screening.The records were divided randomly between 6 reviewers.Everyone reviewed the abstracts of assigned records to decide whether each record met the inclusion criteria or not.Inclusion criteria for both title and abstract screening were the same and based on the PICO framework.Specifically, the following criteria should have been met to include the record: 1.The article should present original research on one or more case studies (i.e., aquifers) that employ a regression ML algorithm to predict a specific and measurable aquifer characteristic in different time steps.2. The article should use a time series of input data to train its algorithm.3. The article should evaluate the accuracy of the prediction by comparing the ML algorithm outputs with observation.4. The article should report its goodness of prediction with quantitative measures of performance (i.e., statistical indices).
After the abstract screening, 347 records were retained, of which 23 were either not retrievable or not in English in their full-text form, which left us with 324 articles for the full-text screening (Figure 1).Six individuals reviewed the retrieved full texts according to inclusion-exclusion criteria as the third step of article screening.After the full-text screening, 127 articles were removed based on the exclusion criteria (Figure 1).Eventually, a total of 197 articles remained to be included in the systematic review and to be investigated in meta-analysis.Figure 1 depicts different steps of the systematic review and the number of records in each step.
Finally, key characteristics of the final papers were extracted in the data extraction stage.A second reviewer also checked a random subset of the included studies to ensure that data had been extracted accurately.All team members involved in the extraction process also appeared as second reviewers and were assigned to check the extracted data by other team members to ensure data hygiene and minimize human error.Finally, the extracted data went through data curation.

Statistical Analysis
The number of research articles using ML to predict groundwater characteristics is growing after 2014 (Figure 2), with a spike in 2017.Out of 197 articles included for metaanalysis, 33 (16.75%) were published in 2017.Included records were published in various journals, of which the Journal of Hydrology (10.66%) published the largest set of papers, followed by Water Resources Management (7.11%), and Environmental Earth Sciences (5.08%) (Figure S1 in the supplementary material).
The systematic literature search showed that Iran (24%), India (18%), China (16%), and the United States (10%) had the highest number of articles, respectively (Figure 3).Iran as the leading country in the number of articles in this systematic review also deals with a state of water bankruptcy partly due to anthropogenic depletion of its aquifers (Noori et al., 2021).The list of countries with the highest number of articles also agrees well with the list of countries with the highest dependency on groundwater resources.According to [50], the top five nations with the largest estimated annual groundwater extractions in 2010 are India (251.00 km 3 /year), China (111.95 km 3 /year), the United States (111.70 km 3 /year), Pakistan (64.82 km 3 /year), and Iran (63.40 km 3 /year).It is worth mentioning that Iran, India, China, and the United States use 87%, 89%, 54%, 71% of their groundwater extraction for irrigation, respectively (Margat and Van der Gun, 2013).It should be noted that groundwater depletion due to overdraft for mainly irrigation purposes is reported as a worldwide problem.According to the findings of [51,52], Iran, India, China, and United states are among the countries with the most reliance on groundwater resources for food production and deal with the consequences of overdraft.Our findings reveal that the hotspots of groundwater consumption and depletion are the popular case studies for the application of ML in groundwater modeling and prediction.In total, the included articles in this study were from 28 countries (Figure 3).Moreover, our findings show that the countries with the highest number of articles are the countries suffering from groundwater stress (Figure S7 in the supplementary material).Most of the papers (56%) had a case study with an area less than 1000 km 2 , followed by study areas between 1000 km 2 and 2000 km 2 (22%), and the remaining 23% had a case study with an area of more than 2000 km 2 (Figure S2 in the supplementary material).Only 6% of the articles studied a confined aquifer, while 5% had a semi-confined aquifer and 89% had worked on an unconfined aquifer or did not mention the type of aquifer in their manuscript.Twenty-seven percent of the articles studied coastal aquifers and the remaining (73%) had a non-coastal aquifer as their case study (Figure S3 in the supplementary material).Being prone to seawater intrusion, groundwater salinization is a common problem in coastal aquifers, particularly where excessive groundwater pumping induces a decrease in the piezometric head [53], and therefore, some of the reviewed studies had focused on predicting groundwater salinity in coastal aquifers [29,54].
As shown in Figure 4, a high percentage of the reviewed articles are from arid and semi-arid regions of the world, where surface water resources are generally scarce and highly unreliable [55].Moving from arid to humid regions, the reliability of surface water resources increases and, as a result, the interest in studying groundwater resources decreases (Figure 4).In total, 26 different ML methods were reported in the articles as tools to predict various characteristics of groundwater resources.Among them, ANN, SVM, and ANFIS were the most popular methods with 53%, 16%, and 10% of total records, while GEP, LR, and GP were applied much less (Figure 5).The employed ANN models had different architectures, but FFNN was the most used (around 66% of records), followed by NARX with 11.3% of records (Figure S4 in the supplementary material).Gradient descent (64.3%),LMA (19.5%), and PSO (5%) were the most used optimization algorithms for training ANN models (Figure S5 in the supplementary material).Most of the papers that used gradient descent mentioned using backpropagation for calculating gradients for the weights of the network.Seventy-nine records used wavelet transformation along with ML models, where 54.4%, 13.9%, and 10.1% of them utilized ANN, ANFIS, SVM models, respectively (Figure S6 in the supplementary material).According to the studies that used wavelet transformation, determining the appropriate decomposition level is an important step as it affects the ML models' performance [56,57].Moosavi et al. (2013) suggest considering the periodicity and seasonality of data series to determine the appropriate number of decomposition levels.In summary, our meta-analysis shows that FFNN with gradient descent as an optimization algorithm is the most employed ML model to predict characteristics of groundwater resources.Based on its wide use and acceptable performance, it can be inferred that this model structure is a suitable choice for the prediction of groundwater characteristics.
Sample division into training, validation, and test sets is one of the important factors in designing ML models.Although some researchers divided the data into only trainingtesting subsets, using three subsets as training, validation, and testing is generally preferable.In the latter scenario, the testing set is never used in the process of model building while the validation set helps with the fine-tuning of the model hyperparameters and even choosing the best model structure.This procedure eliminates the risk of over-fitting (i.e., where an ML model will "memorize" the features of the training input data instead of actual "learning") and ends up with more reliable results where the ML model shows its generality to work well with new, unseen data.
Cross-validation is another model validation technique that uses a resampling procedure and is especially useful when the sample data are limited.In the cross-validation process, instead of a fixed test set, input data are divided into some "folds" and in each training step, one fold is held out as the test set and the model is trained with the remaining data.After training the model, its performance is measured on the unseen test set (i.e., the held-out fold).This process repeats k times, where k is the number of folds, and at the end, the average of k measures of performance is reported as the final measure of model fitness.According to our meta-analysis, 16.2% of the articles used cross-validation, while 12.4% of records used both cross-validation and sample division strategies.A 96.2% of the articles divided their dataset into subsets, while around 80% of these articles only had train-test subsets and 20% had three subsets division.From a data science point of view, this can be a weakness, especially if the models have been exposed to the validation data before the final model evaluation.
As shown in Figure 6, most of the articles have used 70-80% of the data as the training subset and the remaining as the test subset.Similarly, most of the articles having three subsets have used 60-70% of the data as the train set and divided the remaining into validation and test sets (Figure S8 in the supplementary material).The input data length, temporal resolution, and the number of categories are other important factors in ML modeling in general and particularly in hydrological studies.To train a reliable data-driven model in groundwater studies, the model needs to be fed with temporally inclusive input data to be able to predict variable geohydrological conditions and to learn the seasonality.As depicted in Figure 7, while most of the articles had lower than 8 input categories, a considerable portion had between 3 to 4 input categories.This might have two main reasons; first, in many case studies, many potential variables are poorly measured, and secondly, increasing the number of input variables would cause some unfavorable phenomenon in modeling such as the curse of dimensionality.Additionally, the use of fewer input variables to training ML models can imply the efficacy of these models in predicting groundwater characteristics.This is especially important in ungauged regions.The use of ML models in these regions can also be favorable from an economic point of view since these regions usually rely on agriculture, and an accurate estimation of, for example, the groundwater level using limited input data can assist with more cost-efficient irrigation scheduling.As shown in Figure 8, the length of the input data time series was mostly up to around 12 years, and rarely more than 20 years, with very few studies having more than 40 years of input data to train the ML models.The monthly temporal resolution was by far the most popular among the articles (around 65% of the records), followed by the daily resolution with 19.6% (Figure S9 in the supplementary material).This could imply a higher availability of groundwater data in the monthly temporal resolution more than other resolutions.Furthermore, the monthly resolution might be more favorable for large-scale water managing stakeholders and policymakers.
Although our research question was not limited to any specific characteristic, we found that most of the research articles using ML algorithms in groundwater studies were focused on the prediction of the groundwater level (82.5%).The possible explanation for this large number might be related to denser measurements of the groundwater level compared to other variables in practice.Moreover, the groundwater level is a continuous variable that could be regionalized through various interpolation methods.In total, 17 groundwater characteristics were found in the reviewed articles to be predicted using ML, with a discharge or baseflow (6.1%), groundwater recharge (2.7%), and freshwater-saltwater interface level (2.5%) being the most popular ones after groundwater level (Table 1 and Figure S10 in the supplementary material).Our analysis shows that the most adopted input variables for training ML models to predict the groundwater level were groundwater levels at earlier time steps (26.7%), precipitation (25.1%), temperature (13.6%), and evaporation or evapotranspiration (10.5%) (Figure S11 in the supplementary material).Humidity or moisture (2.2%), river discharge (1.9%), surface runoff (1.8%), pumping data (1.7%), and river stage (1.6%) were other important input variables.Table S1 in the supplementary material presents the percentage of the most employed input variables for other predicted characteristics.S12 in the supplementary material).After training the ML model, 61.3% of the reviewed articles used their model to forecast future states of groundwater resources.Figure 9 shows the relative frequency of the forecast timespan.17.8%), the correlation coefficient (14.3%), coefficient of determination (13.7%), and MAE (9.4%) were the most popular measures of performance.RMSE is also the most adopted measure of performance for other predicted characteristics.RMSE indicates the absolute fit of the model to the data and is a suitable measure of performance with the same units as the predicted variable.On the other hand, the coefficient of determination (R 2 ) is a relative measure and does not indicate the absolute precision of the model.

Meta-Analysis
As mentioned earlier, more than 82% of reviewed articles had used ML models to predict the groundwater level and only around 18% of articles were focused on other groundwater characteristics.As a result, our meta-analysis is mostly focused on groundwater level forecasting.We also presented the outcome of the meta-analysis for other characteristics, where possible.Here, we used violin plots that show the probability density of the data at different values using a rotated kernel density plot, which provides insights into the distribution of data and facilitates data analysis and exploration [58,59].In all violin plots, the red dot shows the mean, while the box demonstrates the first, second and third quartiles, where the middle bar is the median.Figure 11 shows the results of the meta-analysis on the predictive capability of ML models for groundwater level prediction through various measures of performance.The statistics of these violin plots are presented in Table S2 in the supplementary material.As shown in Figure 11, meta-analysis confirms the ability of ML models to predict groundwater levels with high accuracy.Table S2 shows the number of reports for each violin plot.For instance, 546 records with an RMSE performance were used to construct the violin plot of RMSE in Figure 11 (mean RMSE of 0.52 m).It should be noted that different papers had various case studies with distinct groundwater levels, therefore, comparing RMSEs might lead to misleading results in some cases.In other words, the variation of the groundwater level in a shallow aquifer is inherently different from that of a deep aquifer.As shown in Figure 11, the results of R 2 presented from 270 records are promising.
Figure 12 illustrates the results for other characteristics that had enough records (more than 15) to conduct a meta-analysis (Table S3 in the supplementary material).These violin plots show an acceptable accuracy of ML models to predict a variety of groundwater characteristics.Contrary to the groundwater level prediction (Figure 11), these results are from fewer records (Table S3 in the supplementary material), therefore, general conclusions should be drawn with caution.What is obvious, however, is the potential of datadriven models to estimate miscellaneous groundwater characteristics accurately with a lower number of input data and easier model structures compared to physical models.
Along with a one-dimensional meta-analysis on the capability of the ML models to predict groundwater characteristics, we categorized the reviewed papers' reports based on different criteria to cast light on the different aspects of data-driven modeling in groundwater studies.Figure 13 represents the results for different ML methods and ANN architectures with a threshold of 15 records in each category (also see Table S4 in the supplementary material).Most employed ML methods (e.g., ANFIS, ANN, SVM) have a comparable and even similar performance according to reported statistical measures.However, ANN slightly outperforms other models in most cases.Generally, it can be inferred that the most influencing factor in the performance of ML models in groundwater studies is the quality and quantity of the input data and not the model.Comparing different ANN architectures, we see that NARX outperforms FFNN, but due to the much lower number of records for NARX, this finding is not conclusive, and more investigation is required.Figure 14 contrasts the results for the type of the aquifer, whether the aquifer is coastal or not, whether cross-validation is used or not, and various schemes for sample division (Table S5 in the supplementary material).As we can see in Figure 14, results from different aquifer types are comparable and no obvious trend can be found.Although the number of records is different for coastal and non-coastal aquifers, from Figure 14 we can infer that the model results for the coastal aquifers are slightly superior.Moreover, Figure 14 shows that in the case of sample division without cross-validation, models are working slightly better.This might be because in cross-validation the considered dataset is divided into different training and test sets multiple times, and the total performance of a model would be the average of all individual performances; however, in classical validation, there is only one training and one test set.Therefore, even one subset with a low performance would decrease the total performance in the cross-validation technique.There is no meaningful trend in the results for different sample division proportions.Figure 15 shows the outcome of meta-analysis for input data's temporal resolution, the input variable selection technique, and forecast for the future (Table S6 in the supplementary material).The daily time series is marginally better than the monthly time series in terms of model accuracy.Studies that used input variable selection techniques had superior results to those without these techniques.It can be inferred that input variable selection is a useful step in setting up ML models to predict groundwater characteristics.According to Figure 15, there is no meaningful trend in the results comparing papers that do forecast for the future and papers that do not.Figures S13 and S14 in the supplementary material depict the results of our meta-analysis for other categories and combinations.

Opportunities
Advances in ML and AI algorithms (e.g., boosting algorithms and deep learning) alongside exponential growth in the availability of computational resources (e.g., Google and Amazon cloud) provide unprecedented opportunities for breakthroughs in groundwater monitoring and forecasting (e.g., reliable forecast with longer lead times).Arguably, the most lucrative opportunity for future work lies in the flexibility of new algorithms to fuse data with widely different spatio-temporal resolutions from various remote sensors, ground observations, and numerical and physics-based models.The new algorithms also allow for the inclusion of physics into the traditionally black-box methods (e.g., physicsbased AI) and quantify uncertainties (e.g., uncertainty-aware AI).Physics-based AI may resolve a longstanding issue that AI methods could not reliably predict/forecast states/outputs that are outside the bounds of observed/training data.Reliable AI/ML methods for the prediction of groundwater states should include a combination of initial states (e.g., groundwater level at the current time, snowpack, surface water availability, temperature, wind, cultivated area), sub-seasonal to seasonal forecasts from numerical models (e.g., from National Oceanic and Atmospheric Administration's Global Forecast System), and large-scale climate signals (e.g., El Niño-Southern Oscillation).The skill of these variables to predict future groundwater states vary across regions and temporal lags, but our understanding of all these predictors is improving rapidly.Remote sensing, tele-stations, and citizen science are providing an unprecedented quantity and quality of surface observations.There, however, exists an opportunity for a significant scientific contribution through developing homogenized, quality-controlled, global products of an in situ observation of groundwater states.Numerical weather prediction models are transforming by the hour and their predictive skills are rapidly enhancing, but there remain great opportunities in this field to resolve microphysics and improve weather forecasts.Finally, new climate signals are being explored, and important advances in convolutional, geospatial, and memory-enabling ML models are being leveraged to explore the entire sea surface temperature (SST) domain to devise new teleconnections, which were not captured by traditional climate signals that mainly depended on differences in SST in specific zones.Anthropogenic factors (e.g., groundwater pumping and artificial recharge) can also be integrated into ML/AI models of groundwater.Finally, while still in its infancy, advances in Interferometric Synthetic Aperture Radar technology and data to estimate surface elevation changes, when merged with physics-based models of elastic and non-elastic ground deformation, can infer groundwater levels at unprecedented spatial (a few dozen meters) and temporal (a few weeks) scales.

Summary and Conclusions
In this paper, we posed the question of how accurately can ML methods model and predict groundwater resources' characteristics?Questions of this nature require systematic review methodologies with explicit inclusion and exclusion criteria that are developed to identify and analyze the relevant literature.Here, by conducting a systematic literature search on the application of regression ML in groundwater resources studies, we found that:


Groundwater level modeling and forecasting is the most popular use of ML in the literature. Groundwater level at the previous time step and precipitation were the most employed input variables to feed groundwater models.


Countries with more dependence on groundwater as a freshwater source produced the majority of studies on the application of ML in groundwater modeling. Feed-forward ANN with gradient descent as the optimization algorithm is the most employed and effective ML model to predict quantitative characteristics of groundwater.This might be due to the simplicity of this architecture and according to the availability of models and codes.


A considerable portion of reports used only 3 to 4 input variables to train the ML models.The acceptable accuracy reported from these models can imply the capability of data-driven models to simulate the complicated nature of groundwater resources efficiently and effectively, even in the case of few input parameters.


The monthly scale is the most employed temporal resolution in time series and, generally, finer temporal resolutions result in higher accuracy.
 Around 10-12 years of data are required to develop an acceptable ML model with monthly temporal resolution. Input variable selection is a highly used technique to choose the most appropriate input variables to train the models, and studies that used these techniques outperformed those that did not.


A high portion of studies use their data-driven model to forecast the future states of groundwater resources. RMSE is the most employed measure of performance between different studies and for various characteristics. While different ML methods have a similar accuracy in predicting groundwater characteristics, ANN is slightly superior to other methods.


When using traditional sample division without cross-validation, models generally result in higher quantitative measures of performance.However, results of cross-validation are generally expected to be a more accurate estimate of the true performance of the model since cross-validation reduces the risk of overfitting and increases the model generality.
With the groundwater modeling literature expanding rapidly and interest in using ML tools in this area gaining higher momentum, meta-analyses, like our study, can help us grasp what we know, don't know, and need to know.Systematic reviews and metaanalyses such as the present study can augment recent comprehensive reviews on the application of ML in groundwater studies (e.g., 28).Future systematic reviews and metaanalysis studies can focus on the application of ML models in other areas of water resources, such as streamflow modeling and forecasting, extreme hydro-meteorological events induced by climate change, and fine-tuning the estimation of evapotranspiration and soil moisture along with remote sensing datasets [60][61][62].Moreover, since hydrological models always deal with inherent uncertainties and ambiguity of model structure, parameters, and input variables, systematic reviews can shed light on the state-of-the-art of uncertainty, reliability, and sensitivity analysis of hydrological models [63][64][65].Although aggregating results from different studies, as done here, have some obvious shortcomings, doing so can shed light on the subject by generating comprehensive and multidimensional findings.The aggregation of results is a two-sided sword though, and since each original research article is specific in its methodology, representation, and interpretation of the results, researchers should be cautious in interpreting the results.

Figure 1 .
Figure 1.Flow diagram of the systematic review.

Figure 2 .
Figure 2. Number of research records included in the systematic review based on their date of publication.

Figure 3 .
Figure 3. Pie chart of the included research articles based on the country of origin.

Figure 4 .
Figure 4. Reviewed articles' proportion according to the average annual precipitation of their case studies.

Figure 5 .
Figure5.The proportion of the reports according to the ML method that they have employed.

Figure 6 .
Figure 6.The proportion of articles dividing the data into two training-testing subsets.

Figure 7 .
Figure 7.The proportion of articles according to their number of inputs.

Figure 8 .
Figure 8. Percentage of the reviewed articles according to the length of the input data time series.

Figure 9 .
Figure 9. Percentage of reports according to their forecast periods.

Figure 10
Figure10presents the percentage of statistical indicators used to measure the accuracy of the ML model of the groundwater level.RMSE (27.4%),NSE (17.8%), the correlation coefficient (14.3%), coefficient of determination (13.7%), and MAE (9.4%) were the most popular measures of performance.RMSE is also the most adopted measure of performance for other predicted characteristics.RMSE indicates the absolute fit of the model to the data and is a suitable measure of performance with the same units as the predicted variable.On the other hand, the coefficient of determination (R 2 ) is a relative measure and does not indicate the absolute precision of the model.

Figure 10 .
Figure 10.The proportion of employed quantitative measures of performance.

Figure 11 .
Figure 11.Quantitative measures of performance for ML models predicting groundwater levels.

Figure 12 .
Figure 12. Results of meta-analysis for various groundwater characteristics.

Figure 13 .
Figure 13.Results of meta-analysis for ML models and ANN architectures to predict groundwater level.

Figure 14 .
Figure 14.Meta-analysis results according to various subcategories in the reviewed reports.

Figure 15 .
Figure 15.Meta-analysis results for three subcategories in the reviewed reports for groundwater level prediction.

Table 1 .
Groundwater characteristics predicted by ML models in the reviewed articles.