Big Data Analytics and Its Role to Support Groundwater Management in the Southern African Development Community

: Big data analytics (BDA) is a novel concept focusing on leveraging large volumes of heterogeneous data through advanced analytics to drive information discovery. This paper aims to highlight the potential role BDA can play to improve groundwater management in the Southern African Development Community (SADC) region in Africa. Through a review of the literature, this paper deﬁnes the concepts of big data, big data sources in groundwater, big data analytics, big data platforms and framework and how they can be used to support groundwater management in the SADC region. BDA may support groundwater management in SADC region by ﬁlling in data gaps and transforming these data into useful information. In recent times, machine learning and artiﬁcial intelligence have stood out as a novel tool for data-driven modeling. Managing big data from collection to information delivery requires critical application of selected tools, techniques and methods. Hence, in this paper we present a conceptual framework that can be used to manage the implementation of BDA in a groundwater management context. Then, we highlight challenges limiting the application of BDA which included technological constraints and institutional barriers. In conclusion, the paper shows that su ﬃ cient big data exist in groundwater domain and that BDA exists to be used in groundwater sciences thereby providing the basis to further explore data-driven sciences in groundwater management.


Introduction
Big data analytics is a revolutionary new buzz-word describing the use of advanced and traditional analytical techniques to leverage vast quantities of heterogeneous data, in-order to provide valuable insights that can be used to propel optimization, development and knowledge discovery [1,2]. To date, the surge of data from online social media activities, internet activities, business transactions, scientific missions, digitization and sensor technologies, among many others, benefit many industries in understanding their operational environment. Collectively these data are referred to as big data. For instance, some healthcare institutes now readily utilize data from electronic patient records, physician notes, medical equipment, social media, to predict the outcome of treatments, the onset These issues hamper the sustainable management of groundwater in the SADC region. In this case, big data may provide a useful tool that can be used to fill data gaps, by exploring, collecting and integrating various sources of groundwater big data (both conventional and unconventional). As well as provide the analytics to transform these data into valuable groundwater information to support sustainable groundwater development in SADC region through big data analytics methods. Big data analytics may also provide the opportunity to address scale issues, where methods can be employed to downscale regional groundwater data to local conditions in support of localized groundwater management (e.g., individual boreholes, wellfields). In many groundwater management scenarios, a local scale analysis is more desired [14]. For SADC groundwater, investing in big data may enable effective harnessing of data from a multitude of new sources, to improve monitoring by using new sensor technologies, to centralize data storage and management and to apply new advanced analytics that can uncover new patterns and trends to drive knowledge discovery.
The aim of the paper is to highlight the potential role big data analytics and big data can play in supporting groundwater management in SADC. Therefore, in this paper we present the current state of the art of big data and big data analytics in the groundwater discipline and explain how it can be applied to groundwater management in the SADC region. The vast spectrum of data sources, analytics tools, and technologies are challenging to navigate while trying to ensure data integrity and accuracy. For this purpose, specialized big data analytics frameworks are employed to facilitate the management and application of big data analytics. Therefore, by drawing on the findings of the review, a novel conceptual big data analytics framework is proposed, that is uniquely designed to address challenges of groundwater management in the SADC region. This paper provides a foundation for the application of big data analytics in the groundwater discipline, in particular for problem-solving applications to the SADC region, paving the way for further work into data-driven sciences.

Research Approach
The information presented in this paper is based on a review of relevant literature. By summarizing and extracting critical information from key research in the big data, big data analytics, water resources and groundwater discipline, we establish the current state of knowledge on big data in the groundwater discipline. For example, we first define what big data are (Section 3.1), then describe where big data comes from in the groundwater sciences in general (Section 3.2) and then describe the various big data analytical methods that are relevant (Section 4). Thereafter we present a brief review of the big data analytical framework (Section 5). Where appropriate, a relation is made to a SADC setting, in order to contextualize the review. The findings of the review are then used to facilitate the development of a proposed conceptual framework for SADC (Section 6). Finally, we end the paper with a discussion of the expected challenges facing the application of big data analytics in a SADC context. Figure 1 illustrates a road map of the paper and how the findings of each section relate to the framework.
The relevant literature was sourced through key word and phrase searchers in popular web-based search engine, such as Google and Google Scholar, SpringerLink, Scopus and Mendeley. Key words and phrases used to search for relevant literature include, "big data", "big data analytics", "big data and groundwater", "big data analytics and groundwater", "big data and water", "big data analytics and water", but not limited too. A total of 135 papers are cited based on the review process.

Big Data: Concepts and Role in Groundwater Science
In this section, we present the landscape of big data in the groundwater discipline. Drawing from the literature, we define what big data are and we introduce the various sources of big data relevant to the groundwater discipline including what these big datasets are composed of and how they relate to support groundwater management in SADC.

Defining Big Data
Big data are referred to as collections of very huge datasets with a great diversity of types that makes it difficult to be collected, stored and analyzed by conventional tools and techniques [15,16]. Big data have a few characteristics that separate them from generally large datasets. These characteristics are recognized as the Vs of big data [17]: volume-big data consist of enormous quantities of data, generally beyond a threshold of one terabyte, however this changes with time, sector, data types and use case; velocity-big data are generated at an exceptionally high rate, such that the volume of big data increases rapidly over time; variety-big data are composed of a variety of different data types from a variety of sources [17].
The three Vs (volume, velocity and variety) are the commonly defined features of big data, which were first coined by [18]. Since then, industry experts have added additional Vs to define big data. For example, IBM added veracity-which describes the inherent inaccuracy and uncertainty present in most large datasets and complex datasets [19]. SAS introduced variability & complexity-which describe the ever changing nature of big data over time with respect to velocity and variety [17,20]. Oracle introduced value as an additional V-which stipulates that big data must contain new knowledge or improve operational efficiency for them to have any meaning in terms of financial investment [17,20]. This value is usually achieved through the use of analytics which transforms the raw data into useful information. For SADC groundwater to realize the value of big data, thought must be given to understanding the Vs in the context of groundwater big data in Southern Africa, as well as the analytics required to turn these data into useful information for groundwater management.
Big data types play a role in how big data are managed from data to information. They can be broadly categorized into structured and unstructured data [17]. Structured data are any type of data

Big Data: Concepts and Role in Groundwater Science
In this section, we present the landscape of big data in the groundwater discipline. Drawing from the literature, we define what big data are and we introduce the various sources of big data relevant to the groundwater discipline including what these big datasets are composed of and how they relate to support groundwater management in SADC.

Defining Big Data
Big data are referred to as collections of very huge datasets with a great diversity of types that makes it difficult to be collected, stored and analyzed by conventional tools and techniques [15,16]. Big data have a few characteristics that separate them from generally large datasets. These characteristics are recognized as the Vs of big data [17]: volume-big data consist of enormous quantities of data, generally beyond a threshold of one terabyte, however this changes with time, sector, data types and use case; velocity-big data are generated at an exceptionally high rate, such that the volume of big data increases rapidly over time; variety-big data are composed of a variety of different data types from a variety of sources [17].
The three Vs (volume, velocity and variety) are the commonly defined features of big data, which were first coined by [18]. Since then, industry experts have added additional Vs to define big data. For example, IBM added veracity-which describes the inherent inaccuracy and uncertainty present in most large datasets and complex datasets [19]. SAS introduced variability & complexity-which describe the ever changing nature of big data over time with respect to velocity and variety [17,20]. Oracle introduced value as an additional V-which stipulates that big data must contain new knowledge or improve operational efficiency for them to have any meaning in terms of financial investment [17,20]. This value is usually achieved through the use of analytics which transforms the raw data into useful information. For SADC groundwater to realize the value of big data, thought must be given to understanding the Vs in the context of groundwater big data in Southern Africa, as well as the analytics required to turn these data into useful information for groundwater management.
Big data types play a role in how big data are managed from data to information. They can be broadly categorized into structured and unstructured data [17]. Structured data are any type of data that can easily be stored, categorized and referenced in tabular form. The main tool to store, access and query this type of data is through relational databases, making them easily readable by machines [20]. For example, conventional hydrological data generated through in situ monitoring commonly constitute point information that can easily be captured in relational databases and conventional spreadsheets. This is typical of structured data.
On the other hand, text, video, audio and images are examples of unstructured data. These lack higher structural organization and are not easily stored in relational databases [20]. For example, videos of a flooding events or social media posts related to various aspects of water and groundwater, constitute unstructured data relevant to groundwater. In addition, remote-sensing images constitute unstructured data, but the meta-data attached to the image is structured [5,21]. Unstructured data are particularly difficult for machine programs to extract information from, at least with traditional techniques. Semi-structured data have some form of structure; however, these tend to be very irregular and often heterogeneous, which makes categorization challenging. Emails and XML files fall into the semi-structured data type [17,20,22].

Sources and Nature of Big Data in Groundwater Sciences
The previous section introduced the characteristics that define big data in general. Intrinsically, these are also the characteristics that make big data difficult to leverage with traditional information systems alone [23]. In this section, we try and define sources and nature of data in the groundwater domain, within a big data context.
A common awareness among data scientists is that not all big data are the same and that the structure and nature of big data and how we analyze them depend on the domain [5]. For example, geospatial data differ from text data (such as from social media posts) and the techniques and tools used to collect, store and analyze each of these types of data will be different [15]. The result is that one needs to fully understand the specificities of the relevant data sources and what information is required from these data before appropriate big data tools, techniques and analytics can be applied.
Data in the groundwater domain has not been static. Over the years, groundwater scientists have explored various sources to collect groundwater data. Table 1 illustrates these sources of data relevant to groundwater. Table 1 includes the traditional sources of groundwater data such as in situ observations or hydrogeological maps, as well as modern data sources such as remote sensing, social media or Internet of things (IoT). Individually, some of these sources may not have the characteristics of big data, but when harnessed together they provide some substantial opportunities for knowledge discovery. Large scale data assimilation models are one example of such systems that incorporate data from different sources, such as field activities, remote sensing and computer simulations. However, at the moment they do not ingest data from unconventional big data sources, such as social media [24].

Field Activities and Historic Sources of Data in Groundwater
In the groundwater sciences, one of the primary sources of data are observations collected during field operations. These activities include drilling operations, pumping operations and monitoring operations. Drilling operations collect data on geological and hydrogeological properties of the aquifer, such as lithology and water strikes. Pumping operations collect data on hydraulic properties of the aquifers, such as yield. Field-based hydrological monitoring operations typically involve the selection of sampling sites (in a hydrogeological context these are mostly boreholes, piezometers and springs) and the collection of in situ point data through the use of various techniques and instrumentation [25]. These data are considered direct observations and are thus typically robust in terms of accuracy. In addition, these data represent local conditions within an aquifer, and are thus preferred for groundwater management. Drilling and pumping operations tend to be occasional activities, while field monitoring data collection is generally carried out on a quarterly basis but may even be less frequent. In modern times, the use of sensors equipped at sampling sites have increased the frequency at which observations are recorded at sampling sites. In some cases, these sensors are connected to remote monitoring centers, allowing off-site data collection. However, this is not the norm across SADC region member states. In the SADC region, there are generally two challenges affecting the impact of these data to support groundwater management. The first challenge is that collecting data from field activities is generally sporadic. For example, field monitoring data collection in SADC has been curtailed by the number and distribution of sampling sites having generally decreased over the years [26]. The results are limited networks of sampling sites that are actively monitoring on a regular basis. This has manifested into a generally sparsely populated (both temporally and spatially) data record across SADC. Second, data storage is disparate and in various formats. For example, some countries store data in centralized databases, while others only store data on spreadsheets or in hardcopy form [26]. This challenge ultimately affects data retrieval and sharing.
Nowadays, data from this source may be stored in databases, digital spreadsheets or in GIS files. However, in the past, the results of field activities were recorded in reports and on physical maps. These historic data exist either in hardcopy form or scanned documents. Many times, these sources of data idle in archives, as digital forms are more favorable. However, through a process called optical character recognition (OCR), written text can be converted into machine readable characters [27]. Similarly, computer visions applications combined with deep neural networks have shown potential to transform raster maps (images) into vector data [28]. Digitizing and transforming these sources of information into machine readable data can create a new stream of big data [29].

Remote-Sensing Big Data
Field monitoring hydrological data do not necessarily constitute big data, in the ontological sense of the word [2]. These data are easily managed and analyzed by standard information systems. When looking for big data sources for groundwater, remote earth observation systems or remote-sensing data are the obvious candidates. Remote-sensing data truly are big data, constituting highly dimensional, highly heterogeneous and increasingly voluminous datasets [5]. Remote-sensing data constitute all data collected from ground, airborne or spaceborne earth observation instruments. Remote sensing for earth observation started in the late 1950s with the launch of the Sputnik 1 satellite [30]. Since then, hundreds of earth observation satellites have been launched, some specifically to collect data on Earth's hydrological systems, such as Landsat or gravity recovery and climate experiment (GRACE) [31]. Some of these remote-sensing missions, such as Landsat, have been active since the early 1970s [30]. Over the years new remote-sensing missions have been undertaken and advanced, which has contributed to an ever-increasing big dataset. For example, NASA's SMAP missions collect 458 GB of soil moisture data every day [32]. Table 2 illustrates some of the remote-sensing products that are relevant to groundwater. For a more detailed description of the missions, refer to [32,33]. These remote-sensing missions generally provide global coverages of gridded data products, including the SADC region. In SADC, where local in situ monitoring data are scarce, remote-sensing data can fill the gap, providing a better temporal and spatial coverage. Remote-sensing big data also have the potential to provide spatial and temporal coverage needed to close terrestrial water budgets [34][35][36][37], although uncertainty in sensor estimates and over-simplification of water budget models has often resulted in erroneous results.
On the other hand, one challenging aspect of remote-sensing data for groundwater management is the coarse spatial resolution of the data. Hydrological investigations using remote-sensing data generally have been carried out at regional or global scales. This is because much of the remote-sensing data are at a spatial scale that does not support local or site-specific analysis ( Table 2) [33]. This is especially true for GRACE data, which has a spatial resolution of 110 km. At this scale, many of the smaller transboundary aquifers would be contained in one or overlap only a few GRACE pixels. This hinders their applicability to local scale use. In fact, most of the studies done using GRACE data have focused on regional scale investigations [38][39][40][41][42]. In order to be applicable to local scale groundwater management, the resolution of GRACE data must be refined. Big data analytics has the potential to apply methods for downscaling remote-sensing data to support local groundwater management.
Ground-based and airborne geophysical surveys also contribute to remote-sensing data. Geophysical surveys are a broad category of observational techniques, which can be used to collect data on aquifers and groundwater properties. Active geophysical methods rely on generating some type of artificial energy fields, such as electro-magnetic field and recording the interaction with the water or rock interfaces. Passive geophysical methods rely on measuring natural fields of the Earth at various location, such as the magnetic field and inferring rock or water properties from these observations [43]. Geophysical methods are numerous and include ground penetrating radar, electric resistivity and seismic reflection/refraction, among many others [44]. Geophysical survey methods allow data collection at greater spatial scales than in situ point observations, but smaller than satellite-based remote sensing. However, they are expensive, and they are generally only performed during groundwater exploration exercises. Thus, these types of data are not encountered frequently in SADC, but are available for some transboundary aquifers in SADC, such as the Zeerust/Lobatse/Ramotswa dolomite aquifer [45].

Simulated Hydrological Data
In this section, we discuss hydrological data generated through computer models or through reanalysis applications. In essence, these datasets represent synthesized data, generated through numeric methods and data assimilation techniques. The data available through these sources are comprehensive, providing detailed spatiotemporal data on numerous hydrological variables.
In this category of big data, one source stands out as being extensive-that is the results of atmospheric models. This is a broad category of numeric weather and climate models that are used to predict future weather and climate patterns in the short and long-terms and at the regional or global scale [46]. It includes models such as global circulation models (GCM), regional climate models (RCM) and numeric weather prediction (NWP) models. Some of the most advanced atmospheric models, such as GCMs, are often coupled with land-surface models, sea ice component and ocean circulation models and are only capable of being run on powerful supercomputers [47]. Hence, the amount of data processed and generated by these models is enormous. These data are often made available to the general public for free or through various paid license agreements. For example, European Centre for Medium-Range Weather Forecasts (ECMWF, Reading, UK) disseminates much of their data via their website (https://www.ecmwf.int/en/forecasts/datasets). Of particular relevance to hydrological sciences is the forecast data for precipitation, which may be useful when understanding the future trends in groundwater resources in the SADC region.
Land-surface modeling applies complex mathematical equations to integrate hydrological, biologic and radiation-based energy exchange processes at the land-surface, between the land surface and the atmosphere and within the soil-column [48]. These models assimilate an extensive array of both in situ and remote-sensing-based observational data to derive natural fluxes at the earth surface [49]. For example, the land data assimilation system from NASA provides numerous datasets on various hydrological variables, such as evapotranspiration, soil moisture and run-off, on a global scale (visit https://earthdata.nasa.gov/ for data retrieval). Datasets from these systems are particularly useful for hydrological applications, providing data for an integrated systems analysis.
Lastly, reanalysis datasets provide an additional trove of historical data, which are useful to understand past trends in natural earth systems. Reanalysis data refer to original in situ observational datasets that have been reanalyzed and amended using data assimilation techniques and are generally the by-products of land-surface models and atmospheric models [50]. Examples of re-analysis datasets are ERA5 from ECMWF, NCEP/NCAR Reanalysis I from the National Center for Environmental Prediction and National Center for Atmospheric Research (NCEP, College Park, MD, USA; NCAR, Boulder, CO, USA), and the Japan Meteorological Agency's JRA-55 (JMA, Tokyo, Japan), which are easily retrievable and widely used in hydrological applications [50]. Although these datasets are primarily geared towards atmospheric sciences and land-surface states, many of the parameters included in the datasets are corelated to groundwater processes (e.g., stream discharge, soil-moisture), making them valuable data sources.

Social Media and the Web Data
With the advent of the Internet, a new channel for communication and transfer of information was created. Today, almost all industries and individuals rely on the Internet in some way. It is no surprise then that the volume of data being generated and transmitted over the Internet is both enormous and complex. Of concern to groundwater is all the hydrological information being transmitted over the Internet, which is not already stored in specialized data repositories. This means information present on webpages and social media threads, among others.
Social media provide an unconventional new big data source in groundwater sciences. It may be hard to visualize how unstructured social media data may be useful in a conventional sense, but these data types make up majority of big datasets [17]. With modern advanced analytical techniques such as natural language processing or video analytics, valuable information can be extracted from these data sources [51,52]. For example, [53] demonstrated a framework to infer actual levels of rainfall from the contents of Twitter feeds. This study used certain words/phrases combination commonly appearing in tweets, as a rainfall magnitude reference. Statistical learning was then applied to model and forecast the magnitude of a rainfall event based on the wording in twitter posts and the actual rainfall amount. In addition, [54] showed the production of real-time flood extent maps from live twitter feeds in Jakarta, Indonesia. In this case, twitter posts that contained geo-located information on water depth and extent were used to infer near real-time flood extent maps by combining the data with digital elevation models using a flood-fill algorithm. Not only do these data represent local conditions, they are also streaming in real time and they have real world applications in supporting disaster relief and risk management efforts [1,55]. These examples clearly demonstrate the potential value of unstructured data, specifically from social media, in hydrology-related applications.
However, from a groundwater perspective in the SADC region, data from social media platforms or other similar data conduits, may have limited value. These types of applications work well in developed areas, with a large number of users and sufficient Internet access. In the less developed urban areas and rural settings of the SADC region, the spatial coverage of this type of data may be limited [56]. In addition, it is very difficult to visualize groundwater from the surface, as it is hidden below layers of soil and rock. Thus, it remains to be seen whether social media related groundwater data are prevalent and quantifiable in countries within the SADC region.

Internet of Things Data
According to [20], an estimated 20.8 billion connected devices will exist in 2020. Connected devices are electronic equipment that can connect with each other and various digital systems over the Internet [57]. These devices include objects such as smartphones, sensor equipment or even house-hold appliances. Some of these objects are continually streaming environmental data. For example, [58] demonstrated the use of atmospheric pressure and temperature data collected through a smartphone application to improve near real-time weather predictions. Similarly, [59] showed the advantages of using smartphones and connected personal weather stations to monitoring weather patterns in Amsterdam. The real-time spatial and temporal distributions of data from these sources allow a level of data insight that was not possible before. This is the realm of Internet of things (IoT).
The application of IoT systems to groundwater science can generate large amounts of data on local groundwater conditions, faster than conventional or manual data collection, providing improved management of groundwater resources [60]. For example, real-time IoT groundwater monitoring and data management systems have been piloted in various regions, such as California and India, to improve sustainable groundwater management [61,62]. Sensor equipment is continually decreasing in cost and increasing in accuracy and may certainly improve the data collection capabilities of the groundwater domain in Southern Africa.
Additionally, citizen-science missions have shown promise in collecting environmental data such as groundwater levels [63]. In fact, virtual citizen science missions have shown to outperform conventional data collection methods, collecting data in a few days that would normally take months [64]. However, the quality of citizen science data is not always of a high standard and proper quality assurance measures must be in place to ensure robust results. By incorporating these technologies and data collection tools into various groundwater-related initiatives, new local scale data can be generated, in some cases in real time. These data can be fed into big data analytical platforms (collections of software and hardware utilities for management of big data), integrating them with other datasets, turning them into useful information to support groundwater management [65].

Methods in Big Data Analytics
The value of big data is truly realized when it is transformed into useful information. Big data analytics covers a comprehensive package of advanced analytical, statistical, mathematical and graphic methods that can be used to transform the data into useful information [51]. In this chapter, we discuss big data analytics in more detail, focusing on specificities that are important for transforming groundwater big data into useful information.

What Is Big Data Analytics?
According to [51], big data analytics is advanced analytics operating on big data. Many of the tools and techniques employed in big data analytics, such as machine learning, have been available for many years [66]. It is only recently, with the surge in big data, that the value of these advanced analytical techniques has been realized. Compared to traditional analytics approaches, advanced analytical techniques perform well when dealing with very large, heterogeneous datasets, requiring less data pre-processing, as shown in Table 3 [67]. For example, machine learning can work on both structured and unstructured data, while traditional analytics works well only on structured data. One of the major differences between traditional analytics and big data analytics is the processing platforms required. Big data generally requires parallel processing methods to effectively analyze these large datasets. Big data analytics methods are designed to operate over multiple distributed processors, whereas traditional analytics methods are generally designed to operate on single machines [67]. Traditional analytical methods are only efficient when significant sampling and dimensional reduction methods (e.g., principal component analysis, genetic algorithm) are used to reduce data size. In addition, traditional analytics is not suited for parallel processing frameworks. Big data analytics together with traditional analytics may allow us to leverage various sources and types of groundwater big data, turning them into useful information for a groundwater manager to use. Table 3. Traditional analytics vs big data analytics (adapted from [67,68] Generally, big data analytical techniques include traditional analytics such as data mining, statistical analysis, SQL queries (Structured Query Language queries) and data visualization, which work well on structured data. Advanced analytical techniques such as natural language processing, text analytics, video analytics, audio analytics, artificial intelligence and machine learning work well with heterogeneous unstructured data [17,51]. An assemblage of these techniques is usually used to turn raw big data into information. For example, in shale analytics, a combination of data mining, machine learning, artificial intelligence, correlation analysis and pattern recognition is used to extract information from text reports, sensor data and geophysical surveys from thousands of existing well operations. This information is then used to predict the success of new well operations [9]. In this case, the combination of analytics is uniquely designed to extract value from the types of data present in shale gas operations. In order to leverage big data in groundwater in SADC, a similar set of unique analytical operations is needed to extract information from the types of data expected. It is also important to note that the type of analytics required should address the problem being investigated.
The spectrum of big data analytical techniques is vast and an explanation of all these techniques is beyond the scope of this study. However, understanding the role various big data analytics play in deriving information from data are key to derive the knowledge required to improve decision-making. For example, Table 4 presents a summary of common big data analytical techniques and the typical methods they include. These techniques can be used for a myriad of tasks such as extracting information from text data (text analytics), video files (video analytics) and audio data (audio analytics) and even geospatial data [17]. Hence, data collected from citizen science initiatives, remote-sensing data, social media data and conventional hydrological data can be turned into useful information for advancing understanding in groundwater management.
Generally, the role of big data analytics is to understand historical events or observations (descriptive analytics), what will occur based on historical observation (predictive analytics) and what is the best solution under uncertainty (prescriptive analytics) [69]. Translating this to a groundwater context allows us to understand what the fundamental interrelation and operation of various hydrogeological processes are based on current data (descriptive analytics), using this knowledge to predict future groundwater scenarios (predictive analytics) and then understanding what the best actions are going forward (prescriptive analytics). This is where the paradigm shifts towards emphasis on data-driven solutions, allowing our analysis to be prescribed by trends in the data rather than theory.

Statistical Methods
Statistical methods in this case relate to conventional data analysis techniques that have been at the forefront of traditional empirical analysis [72]. These methods are rooted in statistical and mathematical sciences [69]. They are designed to perform functions of association among data points, the segmentation and clustering of data, the categorization of data, anomaly detection, regression and prediction analysis within structured datasets [72]. For example, a multivariate regression analysis can be used to quantify the causal relationship between a series of variables, which can then be used to predict the outcome of a set of dependent variables. These techniques are still widely employed today in extracting information on groundwater data as well as modeling of groundwater processes (e.g., geostatistics). For example, statistical techniques are used to design groundwater monitoring systems, assess groundwater quality and simulate groundwater flow. However, they suffer from certain drawbacks. Statistical methods are not fully optimized to handle large streams of heterogeneous, highly dimensional and noisy data [17]. Standard statistical techniques are more suited to operate on samples of population statistics, which are then used to infer across the entire population based on the statistical significance of the results [17]. Contrary, big data analytics operates on the majority, if not all, of the data in the population. Hence, the idea of statistical significance is no longer relevant. Furthermore, these standard statistical techniques are difficult to implement in parallel-processing environments, which is often necessary when dealing with big data [15]. However, incorporating these methods into big data analytics applications may still prove useful in handling of traditional structured data on groundwater.

Data Mining
Data mining is a term used to describe the use of big data analytical techniques to extract new information, such as patterns in data, relationships among variables, groupings of closely related data points or prediction of outcomes, from very large datasets [15,69,70]. Data mining involves the use of many statistical and machine-learning methods. Data mining is not restricted to very large datasets, and has been in use since before the advent of big data [72]. Only now, some of the traditional analytical methods have been extended to cope with processing big data. For example, traditional clustering algorithms such as K-means have been extended by partitioning large datasets into samples that can be processed across multiple machines. Results of the samples are combined to represent the overall dataset [15]. Typical algorithms for this approach include clustering large applications (CLARA) algorithm and clustering large applications based upon randomized search (CLARANS) [15]. Data mining is also not restricted to structured data and can be applied in text, image, video and audio analytics, etc. [17,52,73]. Data mining is the cumulative task of transforming the data into useful information and is thus an important step in any big data analytics application.

Artificial Intelligence and Machine Learning
One of the most common big data analytical techniques employed in the literature is machine learning, which is a branch of artificial intelligence. Machine learning consists of self-learning algorithms which form the backbone of most artificial intelligence programs [69]. Machine-learning methods provide a robust avenue to analyze large, highly dimensional, highly complex nonlinear systems [74]. For example, the complex interactions between various components of hydrological systems in nature are often nonlinear. We model these systems using conventional statistical analysis results in simplified and inaccurate outputs. In this instance, machine learning methods are better suited.
Machine learning can generally be classified into three broad learning categories: supervised, unsupervised and reinforcement machine-learning methods [75]. Supervised machine learning requires so-called labeled data that can be used as validation during the training process [76]. Labeled data are data points that have been tagged with known properties for that class of data, for algorithms to learn from. The model calculates expected outputs through a reiterative back-propagation training/learning process. The algorithm/or rules are defined that best predict labeled output based on input data. In unsupervised machine learning, the model operates on unlabeled data, hence there are no outputs for which to train the algorithm. In this case, the algorithm finds hidden patterns and groups in the data to perform clustering. Reinforcement learning algorithms are trained on labeled data that are intermediate between supervised and unsupervised. Instead of using labeled data that provide a correct answer for rules, the labeled data only provide an indication whether an action is correct or not. Indications of correct rules or actions or incorrect rules or actions are received through reward or punishment signals, respectively [75]. Through this process an algorithm is trained.
Generally, machine learning is used to perform four basic tasks, which include regression, classification, clustering and association [70]. Supervised machine-learning algorithms primarily perform either regression (e.g., linear regression) or classification (e.g., k-nearest neighbor) [76]. Unsupervised machine learning primarily performs the tasks of clustering (e.g., k-means clustering) or association (e.g., A priori algorithm) [77]. Reinforcement learning is best suited for determining the best possible actions within an environment based on maximizing the reward. Machine-learning approaches are all data-driven requiring a large number of data points to achieve realistic accuracy. The benefit is that these models rely on real world data, without making any a priori assumptions about the system. When coupled with logical understanding of processes, these techniques allow new unforeseen relationships to be uncovered.
Traditionally, physics-based numeric models and conventional statistics have been the pervasive tools to simulate groundwater processes. These models require advanced a priori knowledge of the aquifer as well as complex data on a multitude of aquifer parameters to develop realistic simulations of groundwater processes [78]. This makes numeric models particularly complex to develop. A comparative study among machine learning techniques and numeric models for modeling groundwater dynamics in the Heihe River Basin, North Western China, showed generally favorable results for machine learning compared to numeric models [79]. Hence, machine learning-based groundwater models provide an alternative tool to physics-based process models.
Although the use of machine learning in groundwater is fairly nascent, there are a few case applications which allow us to illustrate its use. For example, groundwater level modeling and forecasting have been accomplished using various machine-learning methods [80][81][82][83][84][85]. Groundwater level forecasting is a particularly useful application area for machine learning, and it provides predictions of future groundwater levels to aid groundwater management. Similarly, groundwater quality mapping has been demonstrated by [86], using multivariate cluster analysis. [87] demonstrated the use of a boosted regression tree framework to model and predict nitrate concentrations in Central Valley aquifer, California, USA. [88] explored the use of various machine learning algorithms to predict groundwater recharge. Machine-learning algorithms have even been used to map surface water bodies from Landsat images [89], as an example of image analytics in hydrological sciences. The benefit of machine learning and artificial intelligence is in its ability to describe and predict real world scenarios, as well as to prescribe the best actions for a desired outcome. This feature may be key in developing data-driven solutions to support groundwater management.

Uncertainty Analysis
One of the most important concepts when collecting and analyzing big data are dealing with the uncertainty [71]. In big data analytics, this uncertainty is generally a result of large, highly heterogeneous, multidimensional datasets. These features of big data introduce many unstructured, inconsistent, incomplete and noisy data to the big data analytics process. The collection of data from heterogeneous sources in a variety of formats creates complexity in assuring the quality of data. For example, data from social media posts are not generated through rigorous scientific processes, and should thus be subjected to enhanced data quality measures [90,91]. Failure to address data quality and uncertainty early in the analysis process can create compounding effects across the big data value chain and can ultimately reduce the accuracy of outputs [71].
Additionally, the lack of training and understanding on the perceived nuances within various big data analytical algorithms may lead to erroneous applications [92]. This emphasizes the need to select proper techniques when dealing with big data, which is a statistical skill. Traditionally, mechanisms to deal with uncertainty involve tasks such as outlier detection, removal of duplicates, missing data detection and handling and unifying datasets [92]. However, even with data preparation taking place, there will still be inherent errors in big data that are difficult to detect. [71,93] discussed several strategies in mitigating errors during statistical learning in big data analytics. This includes incorporating techniques such as probability theory, Bayesian theory, Shannon's entropy, rough set theory, fuzzy set theory into the big data analytics process.
From a groundwater data perspective, the inclusion of remote sensing and other sensor-based measurements should be done with caution. Previous studies have shown these sources to contain uncertainty associated with various remote-sensing errors [94,95]. Additionally, it is also common for local scale variables such as groundwater levels to be uncertain. This is largely a result of missing data, poor data capturing and measurement errors. Efforts must be put in place to ensure reduction in the uncertainty of collected data, and that methods applied are relevant to the type of data being explored.

Visualization Tools
Visualization tools are techniques used to intuitively investigate big data using graphic means [15]. This typically involves the use of graphs, tables, images, diagrams and other ways to display data. These data visualization tools allow for an intuitive view of data, allowing patterns to be discerned based on expert judgment, instead of sophisticated quantitative analysis. For example, this is applicable when dealing with geospatial data, whose properties are reliant on neighboring data points [96]. However, one of the issues with representing big data in graphic way is that they are too large and contain too many dimensions to represent fully in graphs and tables. Data scientists must condense data, through feature extraction or geometric modeling to properly display them [15].

Big Data Analytics Platforms and Frameworks for Geo-Spatial Data
In a sea of big data tools, techniques and methods, groundwater scientists looking to leverage big data to support groundwater management can become overwhelmed. Big data platforms are enterprise scale solutions used to facilitate the use of big data to meet a specific industry need. They are generally a collection of hardware and software layers, built upon a specific big data processing framework. The function of modern big data platforms is to leverage big data. This is achieved through a process of data acquisition, data storage and preprocessing, data transformation through analytics and information dissemination [97]. Figure 2 illustrates a general reference framework for big data, which includes the typical features or components required for any big data platform. for local scale variables such as groundwater levels to be uncertain. This is largely a result of missing data, poor data capturing and measurement errors. Efforts must be put in place to ensure reduction in the uncertainty of collected data, and that methods applied are relevant to the type of data being explored.

Visualization Tools
Visualization tools are techniques used to intuitively investigate big data using graphic means [15]. This typically involves the use of graphs, tables, images, diagrams and other ways to display data. These data visualization tools allow for an intuitive view of data, allowing patterns to be discerned based on expert judgment, instead of sophisticated quantitative analysis. For example, this is applicable when dealing with geospatial data, whose properties are reliant on neighboring data points [96]. However, one of the issues with representing big data in graphic way is that they are too large and contain too many dimensions to represent fully in graphs and tables. Data scientists must condense data, through feature extraction or geometric modeling to properly display them [15].

Big Data Analytics Platforms and Frameworks for Geo-Spatial Data
In a sea of big data tools, techniques and methods, groundwater scientists looking to leverage big data to support groundwater management can become overwhelmed. Big data platforms are enterprise scale solutions used to facilitate the use of big data to meet a specific industry need. They are generally a collection of hardware and software layers, built upon a specific big data processing framework. The function of modern big data platforms is to leverage big data. This is achieved through a process of data acquisition, data storage and preprocessing, data transformation through analytics and information dissemination [97]. Figure 2 illustrates a general reference framework for big data, which includes the typical features or components required for any big data platform.  [98,99]). Figure 2. Big data analytics value chain (adapted from [98,99]). Data acquisition revolves around connecting to relevant data sources, determine individual data products and ingestion mechanisms. Here, one must consider the type of data being collected (e.g., structured versus unstructured), access and usage protocols for the various sources, the volumes of data required (which influences how data will be transmitted from the source to the processing location) and meta-data generation [100]. For example, the size of some data products makes it impractical to retrieve data from the sources repeatedly for analytical queries. In this case, it may be more advantages to ingest entire datasets and store on local systems. The complexities associated with data collection make the data itself an important component of any big data platform.
Data pre-processing focuses on addressing the quality and uncertainty in the data, as well as the conversion of unstructured data to structured data. The purpose of this component is to create analysis-ready datasets. In this step, one must consider the type of data required for analytical operations, data cleaning protocols that are necessary, the uncertainty of the data and the post-processing algorithms that can be applied to improve accuracy in the raw data. The caveats (i.e., limitations and inaccuracies) of individual datasets will be important in this step [101]. Once the data have been preprocessed, then data storage can take place. This requires knowledge on how data are to be curated, the type of data being stored (i.e., structured or unstructured), the processing environment required, meta-data and the indexing paradigm. For example, in the Earth Science domain data will most certainly be geospatial in nature, indexing the data along temporal and spatial dimension would support faster and more versatile analytical operations [102]. Figure 2 also illustrates how the value of big data increases across the value chain. Big data analytics plays an important role in the value chain, leveraging big data in driving the knowledge discovery process, as we move from raw data to useful information. In this component, many of the analytical methods described in Section 4, will be useful. However, developing data-driven modeling through machine learning and artificial intelligence is perhaps the current status quo in terms of extracting value from the data. Descriptive, predictive and prescriptive analytical models, if feasible, can provide additional tools to support groundwater management. For example, descriptive and predictive models may allow simulation of current and future conditions groundwater conditions, while prescriptive models may allow determination of the impact of various management decisions. Finally, usable information must be disseminated in the form of maps, figures and tables (etc.). This information can be usable as it is or it can be incorporated into decision support systems, early warning systems or dashboards to facilitate decisions ( Figure 2).
Addressing some of the challenges facing groundwater management in SADC may require a wholistic solution such as a big data platform. For example, the disparate nature of groundwater big data could be centralized, the application of analytics could be simplified with built-in methods and functions, and the information could easily be accessed through web-based services. Hence, big data frameworks and platforms that can be used to implement a big data approach in support of sustainable groundwater management in SADC are reviewed below.
Many of the data sources described in Section 3 house their data in large data warehouses or centers, which are distributed across the globe. These data centers can be accessed through various web-based platforms, such as Earth Explorer (https://earthexplorer.usgs.gov/), EarthData (https: //earthdata.nasa.gov/), ESA Earth Online (https://earth.esa.int/web/guest/data-access). For example, most data generated by NASA missions get stored in distributed active archive centers across the United States, which can be accessed through various web-based platforms and software [103]. However, navigating, extracting and processing vast amounts of remote-sensing data from various data sources to apply to a specific objective, such as to support groundwater management in SADC region, can be technically challenging [33]. Often, specialist skills and tools are required to properly integrate and use the vast volumes of groundwater big data available.
In order to address some of these challenges, many agencies have developed special platforms that can be used to leverage these big data. The Australian Geoscience Data Cube (AGDC) is an example of a purpose-built big data platform that focuses on leveraging remote-sensing big data, particularly Landsat, for Australian geoscience applications [104]. Hence, the platforms, data collection, storage and analysis features are tailored toward managing geo-spatial remote-sensing data. For example, data ingestion and preprocessing components focus largely on refining incoming raw data into analysis-ready products before data storage, using standard techniques. Data storage follows a multidimensional data array format with geospatial indexing (Data Cube). The architecture for this system is supported by the National Computational Infrastructure (NCI) Facility and their high-performance computing framework.
EarthServer is a geospatial big data platform that is more generalized and interoperable, by focusing development on open geospatial data standards, such as those provided by the Open Geospatial Consortium (OGC) [105]. The platform is supported by the Rasdaman framework, which is an array-based, fully implemented parallel storage and processing platform. The platform allows various front-end applications to be attached for specific use cases.
IBM s physical analytics integrated data repository and services (PAIRS) is another geospatial big data platform [106,107]. Its focus is largely on facilitating and simplifying the collection, integration, preprocessing, storage, retrieval and analysis of heterogenous spatial data. Data are collected and preprocessed into analysis-ready products, indexed and stored along a common geo-spatial grid. Frameworks such as Hadoop and HBase support the storage and processing. Unlike the other platforms that focus on raster data, PAIRS provides facility for unstructured data types such as from IoT and social media. The unstructured data are transformed and stored alongside the raster data.
The Earth System Grid Federation (ESGF) is a multi-agency, international collaboration focusing on the sharing of climate related data [108]. The design of the ESGF is based on geographical independent data nodes that are built on common infrastructure. The nodes adopt common federation protocols and API's (Application Programming Interfaces) that facilitate peer-peer communication and transfer of data. At the moment the ESGF is not an analytics platform, instead focusing on data indexing and data access.
Besides the aforementioned big data platforms, there are a number of big data geospatial frameworks that can be implemented as geospatial big data processing solutions. These include ST-Hadoop [102], SpatialHadoop [109], Hadoop-GIS [110], GeoWave [111] and GeoSpark [112], among others. These frameworks facilitate the distributed or parallel processing of geospatial big data.

Conceptual Framework
Although many of the platforms and frameworks described here are suited to management of the geospatial data, the unique groundwater management challenges faced by SADC warrant additional features. For example, a SADC framework must include some features that allow local scale data gaps to be filled and allow nonconnected disparate data sources to be easily ingested. Based on the findings of the review work in this paper, a conceptual big data analytics framework is proposed (Figure 3). The framework illustrates the required features of a big data analytics framework that can support groundwater management in SADC. This includes the typical features such as data collection, processing, storage, analytics and information delivery. However, key features that are unique in the context of this paper include the groundwater management scenarios and downscaling. By including these two features, we position the framework to address the specific groundwater management challenges in SADC. These unique features are discussed in more detail in the following sections. The features of the framework are grouped together under the big data analytical infrastructure, which represents the hardware and software stack that needs to be developed to implement the framework. This framework also focuses on the groundwater big data sources, which in itself is an important feature of any big data framework. The framework is not intended to be a schematic architecture for a big data platform, but rather a reference framework that can be used to develop a big data solution for SADC groundwater management.

Groundwater Management Scenarios
According to [113], one critical task that is often overlooked during the application of big data analytics is the establishment of a sound problem definition. Without the problem definition being clearly defined, which ultimately informs the type of information required, the process is open to ambiguity. A good problem definition must be easily translatable into a quantifiable feature that can be statistically modeled [114]. Therefore, the framework begins by defining the problem. In our case, we define various groundwater management scenarios as the problems that need to be addressed. These scenarios are adopted from the California Department of Water Resources Best Management Practices for the Sustainable Management of Groundwater [115]. They represent the typical issues facing groundwater managers and can easily be assessed through quantifiable criteria, such as thresholds indicating undesirable conditions. These groundwater management scenarios ultimately dictate the type of data required, the scale of the data, the individual datasets required, the analytics needed and the information output. For example, during groundwater drought, it is important to monitor groundwater storage, in order to avoid issues of reduction in groundwater storage. In this scenario, it may be required to acquire groundwater level data, GRACE and other hydrogeological data. However, the scale issue associated with GRACE data limits its application to the local scale. This means that downscaling may be necessary before any valuable information can be generated. In this case valuable information may be a series of high-resolution groundwater storage maps over time, which may allow addressing the impacts on interconnected surface water. Other possible problems may be saline water intrusions in coastal aquifers or degradation of water quality in urban, industrial and agricultural areas. Land subsidence is another problem that must be considered especially in karst aquifers. Table 5 presents possible big data and analytical solutions to the groundwater management scenarios.

Groundwater Management Scenarios
According to [113], one critical task that is often overlooked during the application of big data analytics is the establishment of a sound problem definition. Without the problem definition being clearly defined, which ultimately informs the type of information required, the process is open to ambiguity. A good problem definition must be easily translatable into a quantifiable feature that can be statistically modeled [114]. Therefore, the framework begins by defining the problem. In our case, we define various groundwater management scenarios as the problems that need to be addressed. These scenarios are adopted from the California Department of Water Resources Best Management Practices for the Sustainable Management of Groundwater [115]. They represent the typical issues facing groundwater managers and can easily be assessed through quantifiable criteria, such as thresholds indicating undesirable conditions. These groundwater management scenarios ultimately dictate the type of data required, the scale of the data, the individual datasets required, the analytics needed and the information output. For example, during groundwater drought, it is important to monitor groundwater storage, in order to avoid issues of reduction in groundwater storage. In this scenario, it may be required to acquire groundwater level data, GRACE and other hydrogeological data. However, the scale issue associated with GRACE data limits its application to the local scale. This means that downscaling may be necessary before any valuable information can be generated. In this case valuable information may be a series of high-resolution groundwater storage maps over time, which may allow addressing the impacts on interconnected surface water. Other possible problems may be saline water intrusions in coastal aquifers or degradation of water quality in urban, industrial and agricultural areas. Land subsidence is another problem that must be considered especially in karst aquifers. Table 5 presents possible big data and analytical solutions to the groundwater management scenarios. Changes in groundwater storage can be compared to changes in ground elevation As a specific example, Figure 4 depicts a compartmentalized dolomitic aquifer underlying parts of Botswana and South Africa. In this particular aquifer groundwater over-abstraction has resulted in significant reduction is groundwater levels and has further reduced groundwater storage [119]. In order to address the groundwater management challenges in this aquifer using big data solutions, one could bring together a number of dataset, such as groundwater level observations (shown by the blue circles overlying the aquifer in Figure 4), GRACE data, precipitation data from remote-sensing sources, abstraction data and other complimentary datasets. Together these data can be used to develop data-driven models of spatial and temporal patterns in groundwater storage changes, as well as predict future changes under current conditions [81]. This information can then be used to better inform intervention strategies to reduce excessive degradation of the groundwater resources in the aquifer.
The framework developed in this study is intended to be a conceptual framework that can be used to support groundwater management in SADC region using big data analytics. Thus, it does not include technical details on the individual techniques and methods required for each component to function. For example, how best to integrate and connect the various disparate sources of groundwater data in the SADC region, what methods or models are the best for transforming the data into useful information are questions that still need further research. However, the framework provides a step-wise guidance for the application of big data analytics to different aquifer problems in the SADC.

Big Data Analytics Infrastructure
This component of the framework is the main analytics engine that drives the collection, preprocessing, storage, analysis and delivery of groundwater big data and information. For example, consider the data collected to address the challenges in the dolomitic aquifers of Figure 4. Here, the groundwater level observations, GRACE data, precipitation data, abstraction data and other complimentary datasets, are not expected to be in an analysis ready form. Data cleaning, reduction of uncertainty and uniform indexing of the data will need to be conducted before the data can be stored for later use. Once the data are in a quality assured form and stored, it can be transformed through various analytics, such as downscaling or machine learning models into useful information. This information can then be disseminated to relevant stakeholders through information portals, dashboards, reports, maps and other figures.
In a big data context this type of work would be carried out within a big data analytics platform, such as those mentioned in Section 5 or within a purpose built platform designed for leveraging of groundwater big data in SADC.

Downscaling Methods
Downscaling methods are of particular interest within the context of the proposed framework ( Figure 3). The use of remote sensing, atmospheric models, and land surface model provide a useful avenue to explore new insights into the characteristics and processes occurring in aquifers. However, these big data sources, in many cases, offer only regional scale aspects due to the coarse resolution. In order to improve localized groundwater management (e.g., individual boreholes, wellfields), fine resolution information is essential. Big data analytics techniques can address the mismatch between the regional scale data and the local scale information through the process of downscaling. Downscaling is the process of refining the resolution of coarse scale data to a finer resolution for local scale groundwater management.
Generally, there are two approaches to perform downscaling: dynamical downscaling and statistical downscaling [120]. Dynamical approaches rely on numerical/physics-based models to simulate regional or local variables from global scale models [121]. Statistical approaches model the empirical relationships between large-scale variables (predictors) and small-scale covariates (predictants) [122]. Each of these approaches has their own merits and constraints (Table 6). For example, dynamical downscaling approaches require heavy computational resources and complex

Big Data Analytics Infrastructure
This component of the framework is the main analytics engine that drives the collection, preprocessing, storage, analysis and delivery of groundwater big data and information. For example, consider the data collected to address the challenges in the dolomitic aquifers of Figure 4. Here, the groundwater level observations, GRACE data, precipitation data, abstraction data and other complimentary datasets, are not expected to be in an analysis ready form. Data cleaning, reduction of uncertainty and uniform indexing of the data will need to be conducted before the data can be stored for later use. Once the data are in a quality assured form and stored, it can be transformed through various analytics, such as downscaling or machine learning models into useful information. This information can then be disseminated to relevant stakeholders through information portals, dashboards, reports, maps and other figures.
In a big data context this type of work would be carried out within a big data analytics platform, such as those mentioned in Section 5 or within a purpose built platform designed for leveraging of groundwater big data in SADC.

Downscaling Methods
Downscaling methods are of particular interest within the context of the proposed framework ( Figure 3). The use of remote sensing, atmospheric models, and land surface model provide a useful avenue to explore new insights into the characteristics and processes occurring in aquifers. However, these big data sources, in many cases, offer only regional scale aspects due to the coarse resolution. In order to improve localized groundwater management (e.g., individual boreholes, wellfields), fine resolution information is essential. Big data analytics techniques can address the mismatch between the regional scale data and the local scale information through the process of downscaling. Downscaling is the process of refining the resolution of coarse scale data to a finer resolution for local scale groundwater management.
Generally, there are two approaches to perform downscaling: dynamical downscaling and statistical downscaling [120]. Dynamical approaches rely on numerical/physics-based models to simulate regional or local variables from global scale models [121]. Statistical approaches model the empirical relationships between large-scale variables (predictors) and small-scale covariates (predictants) [122]. Each of these approaches has their own merits and constraints (Table 6). For example, dynamical downscaling approaches require heavy computational resources and complex data, but can produce physically consistent downscaling results [123][124][125]. Statistical downscaling approaches require low computational resources, are easy to implement and require generally less complex data (i.e., fewer variables) from multiple sources [125]. However, statistical approaches are less suitable to model nonlinear relationships between predictors and predictants. Nonetheless, the low computational resources and low data requirements meant that researchers have favored statistical approaches over dynamical approaches [126,127].

Characteristics Dynamical Downscaling Statistical Downscaling
Execution difficulty Difficult to execute, requiring heavy computational resources [131] Easy to execute, requiring less computational resources [128] Data requirements Requires complex data from multiple sources [125] Data requirements are generally lower than for dynamical approaches [125] Downscaling consistency Physically consistent downscaling of climate variables [124] Can downscale to finer resolutions, however nonlinear relationships are hard to model [133] Hydrogeological model inputs Requires extensive a priori knowledge of hydrological processes Requires limited a priori knowledge of hydrological processes [81] Uncertainties Uncertainties introduced through model approximation, assumptions and parameterizations [123] Uncertainties introduced through non-stationarity and high spatial variability between predictors and predictants [123]

Challenges in Applying Big Data to Groundwater Management
According to [134], there are numerous challenges that are faced by experts when trying to implement big data analytics, but these can be divided into three broad categories: (1) Data challenges relate to the nature of big data itself (e.g., volume, velocity and variety, etc.); (2) Process challenges relate to how to capture, integrate and transform data, how to select the right model for analysis and how to provide the results; (3) and management challenges cover issues such as privacy, governance, institutionalization, security, among others. These challenges are further exacerbated by the technological limitations of current information systems [23]. In this section (Section 7), we discuss some of these challenges in the context of groundwater big data in the SADC region in Africa as well as how it would affect the implementation of the framework.
Like all other domains, big data in groundwater within SADC region are expected to have considerable volume, velocity and variety. For example, the data for a 10 • × 10 • tile from MODIS Evapotranspiration dataset for the SADC region can be as large as 20 GB. Multiplying by additional variables and additional tiles needed to model a groundwater management scenario across the entire SADC region would result in the dataset growing rapidly. The technological requirements to store and process such large heterogeneous volumes of data often require dedicated systems beyond the capabilities of conventional desktop systems [23]. In this instance, technologies such as parallel processing infrastructure and clustered computing systems have come to the fore [23]. However, the computational capabilities of many SADC member states may not be advanced enough to facilitate big data approaches. Furthermore, an obvious bottleneck when ingesting huge volumes of data are the high network speed required to move and process big data [135]. This requirement is often lacking in less developed African regions and may even be non-existent in rural regions.
Lastly, big data management challenges are experienced within a SADC context, especially when dealing with transboundary aquifers. The transparency of data sharing across international boundaries is not always welcomed by individual states. Data ownership and data access is often restricted to certain individual or institutions and come with many caveats for their use [12]. This is certainly the case when security issues are present with sharing or use of data. The institutional barriers may become a roadblock. Furthermore, management practices employed by member states are not always aligned with each other [12]. The consequence is that the decisions taken based on the data may be contradicting within transboundary aquifers, ultimately affecting the sustainable management of groundwater.

Conclusions
Groundwater science is generating increasing amounts of data from scientific experiments, sensor arrays, monitoring programs, remote sensing-even social media. Increasing attention is being payed to leveraging these vast volumes of data for new knowledge discovery in groundwater. Improving sustainable groundwater management in SADC is one use case where big data and big data analytics may be useful. Big data analytic's contribution to groundwater management can be two-fold. Firstly, big data analytics can address issues of data scarcity by consolidating data available from different sources, both traditional and unconventional. Secondly, big data analytics can transform data into usable information that can support groundwater management, especially at a local scale. The general consensus in the literature is that big data analytics techniques and methods provide benefits beyond traditional analytics, when dealing with large heterogeneous datasets and are particularly useful when performing data-driven modeling. Advanced analytics such as machine learning have shown a promising insight when modeling groundwater processes. However, the choice of data and the choice of analytical techniques to achieve the analysis goal is critical to ensure data integrity and accuracy along the life cycle of the data. Proper management of data and analytical processes is imperative in this case. A conceptual framework was presented that can be used to facilitate the application of big data analytics in the context SADC groundwater management. This framework considers the required elements in the value chain based on the literature and local experiences in the SADC. Specific big data techniques and methods (e.g., data acquisitions, storage, data mining, machine-learning algorithms) can be used to execute the framework and transform data into usable information. However, it is also clear that some challenges will hinder the progression of big data analytics in the SADC. These challenges include a lack of computing infrastructure (e.g., data storage, network speed) and institutional barriers. Nonetheless, it is clear from this research that there are sufficient data and big data analytics techniques developed well enough to explore its operational use in the SADC region. Future work should focus on highlighting solutions to these challenges and experimenting with specific use cases (such as various aquifer settings) with big data analytics in order to continue developing data-driven sciences in groundwater.