2.1.1. Data Scheduling and Acquisition
The definition of the conceptual model of the site is a fundamental step for any NBL estimation procedure. It is not an internal gear of the implemented procedure, since its definition is above the latter and it is included in the scheme for the sake of completeness. The conceptual model is aimed at identifying the factors (sources and processes) that determine the distribution, in space and time, of the parameters of interest. It constitutes the cognitive framework of the area. It contains interpretative and relational elements that allows for the understanding of the processes at play in the site in relation to the presence of the targeted substances. The conceptual model guides and supports some choices during the NBL determination procedure (e.g., the data grouping, the exclusion of specific observations, the identification of the most suitable statistical indicator, etc.). A good formulation of the conceptual model cannot exclude information on the geological, geochemical, and hydrogeological nature of the investigated environmental matrices, and on the anthropic pressures that, in the past or present, have impacted the study area. All these details are necessary to ensure that the analyzed data are from homogeneous environmental horizons.
This is the first step of the procedure falling within the guided system. It relies on a web-based geographic information system setup by the Department of Environmental Policies of the Calabria Region to contain the environmental data of the Water Protection Plan. This facility, providing a validated and constantly updated database, was equipped with specific additional accessories in order to select the data, follow appropriate screening criteria, and make them readily portable into the NBL estimation procedure. In particular, the system is designed to display the regional map of the monitoring wells, each of which contains the time series of the chemical analysis carried out on the collected samples. The system can be initially queried by entering specific spatial filters (e.g., provincial or municipal limits) in order to have a first rough delineation of the investigation area together with the included sampling stations. Subsequently, the knowledge gained from the conceptual model of the area, together with the use of the Web-GIS overlapping layers, containing the geological information of the site, or the land use map (industrial zones location, urban or agricultural areas, dumps position, etc.) allows to further circumscribe the monitoring points falling within portions of territory with homogeneous characteristics. This operation can be done by means of a “lasso tool” with which freehand selected areas are drawn on the map and the position of the inside monitoring wells, together with their data, are automatically downloaded into an Excel file. This file has a format suitable to be the input of the GuEstNBL software (implemented by the Department of Environmental Engineering—University of Calabria and owned by the Calabria Region, Italy). The tabular format of the file allows the operator to preliminary open it, to easily inspect and edit it if necessary. In this way, new data, not yet included in the GIS, can be added and analyzed together with the pre-existing ones, and those data, pertaining to chemical species having evident correlations with the chemical-physical characteristics of the sampled water, can be divided into homogeneous datasets according to these characteristics. This is particularly the case of redox-sensitive elements such as As, Fe, and Mg.
2.1.3. Component Separation Method
The methodology follows the philosophy according to which the concentration of a chemical species in groundwater is due to the combination of natural and anthropogenic (where it exists) components [
6]. The natural component is associated with the hydro-geochemical characteristics of the aquifer and to solid-water interaction processes. The anthropogenic component is due to the effects of human activities concerning specific substances whose detection have no causal relationship with the characteristics of a given site and the natural phenomena occurring in it [
19,
20,
21,
22,
23,
24].
The data related to the chosen chemical element are displayed in a table in which they appear chronologically and are associated with their own monitoring well. The software calculates the median values from the available concentration time series at each monitoring well and displays them into another table placed next to the previous one. The observed frequency distribution of the median values is automatically reconstructed and showed into a graph.
The frequency distribution of the observed median concentration is interpreted by means of the following frequency distributions combinations:
where
are the observed frequencies and
is the concentration of a given environmental parameter. According to Wendland et al., (2005) [
6], the first term of the sum is associated with the natural component (
) described by a log-normal frequency distribution (
), while the second term, represented by a normal distribution (
), constitutes the influenced component (
). Moreover,
is the average width of the classes associated with the observed frequencies,
is a weight coefficient, and
is a truncation factor (equal to 0.5).
The two probability density functions (PDFs) are both characterized by a standard deviation and a mean that are estimated by imposing an optimization criterion (nonlinear least square method) within a calibration procedure automatically performed by the software. The best fitting of the observed data is showed as a graph.
Parameter calibration is then followed by the automatic computation of the range of concentrations comprised between the 10th (NBL10) and the 90th (NBL90) percentile of the identified log-normal PDF (a graph shows the cumulative distribution function (CDF)), then the NBL is set to be equal to NBL
90 [
6,
7,
18], and its value is displayed in a box next to the CDF graph.
NBL determination must take into account the spatial and temporal variability of the hydro-chemical characteristics of aquifers. The reliability of an estimated NBL depends on the coverage ratio of the available data: a reasonable sample size, suitable to describe the spatial variability of the system under examination, should have a minimum of 15 adequately distributed monitoring points; a reasonable sample size, adequate to describe the temporal variability of the system under examination, should have a minimum of 8 observations, regularly distributed over a period of 2 years on the 80% of the monitoring points. If the dataset meets both requirements, a high level of confidence is attributed to the determined NBL; otherwise, it is considered as provisional pending further data collection [
6,
7,
18].
2.1.4. Pre-Selection Method
The pre-selection global statistical method, outlined in the BRIDGE project just like the component separation method, is based on the assumption that the concentration of specific indicator substances, detected into the analyzed samples, is strictly related to anthropogenic influence. If the concentration of these species is higher than well-defined values, the involved samples are excluded from the NBLs estimation procedure, as they are affected by human impact. The opportunity of excluding the data, because they are considered as not representative of the natural system under investigation, is based on the following criteria: the presence of concentrations of organic contaminants, or of other substances related to anthropogenic activity, greater than 75% of the threshold value provided by the current regulations; the presence of nitric or ammonia nitrogen concentrations whose values exceed 10.0 mg/L for nitrates (NO
3−) and 0.1 mg/L for ammonia (NH
4+) [
7,
18].
Based on what was said above, the software displays an interactive window in which the anthropogenic markers can be selected to exclude the samples with values higher than the established thresholds. The program is provided with a list containing the anthropogenic species together with their control values, and automatically displays those included in the dataset. When a marker is selected and confirmed, the dataset is filtered by those samples not meeting the above requirements [
7,
18].
In this step, the temporal analysis of the data is carried out to identify potential outliers. The outliers represent concentration values that strongly differ from those within the temporal series of each monitoring well. The challenge is to recognize their nature according to the phenomenon under investigation. Concentration spikes could be caused by anthropic contamination or by the strong presence of a specific element in the minerals of a portion of the aquifer solid matrix. In the first case, the high values are certainly outside the objective of the study, while in the second case they could be considered representative of the natural background. Therefore, the removal of potential outliers must be carefully evaluated. These extreme concentration values, frequently occurring in environmental data, are highlighted by the software through graphic methods or through specific statistical tests. In particular, when the outliers analysis is run, the program allows to evaluate, simultaneously or well by well, the results of the most used graphical methods (normal quantile–quantile (Q-Q) plots and box plots) and those of the best known statistical methods such as the Discordance [
24], Huber [
25], Walsh [
26], Dixon [
27], and Rosner [
28] tests. Graphical and statistical results must be evaluated in the light of the outcomes of the conceptual model, so that the operator can decide to keep or discard the single detected outlier, providing a justification for the second operation.
The concentration time series collected in each monitoring well are analyzed, at this point, looking for the occurrence of a trend in the data. The software estimates the slope of a possible trend line through the implementation of the Mann–Kendall test [
29,
30]. Depending on the obtained results, the following actions are proposed: the analysis does not show, for the single monitoring well, a significant trend in the observed time window, so the fluctuation in the data can be attributed to seasonal variations under natural conditions; the analysis shows a significant trend in the time series of a specific monitoring well, so the investigated element is suspected of being subject to non-natural control factors. In the first case, the monitoring well is suitable to move forward in the guided procedure. In the second case, an assessment of whether to exclude the monitoring well from the NBL estimation must be done, also considering the indications coming from the conceptual model. The trend analysis window displays the results in the form of graphs, showing the chronological distribution of the observed data in each monitoring well, and the presence of a trend line where this occurs. A table summarizes the results and allows, via checkboxes, to decide the exclusion or not of the wells having a trend in their data.
As described for block CS1, relating to the component separation method, the software calculates the median values from the available concentration time series that have passed the previous steps, assigns them to the respective monitoring well, and displays them into a table.
The same approaches and the same methods adopted in Block PS2 are now repeated at this stage to identify and manage the potential outliers which can be found among the median values.
This fundamental step is presented at this point of the procedure, but it is performed by the software even before the study for the identification of both temporal and spatial outliers. The reason why this analysis recurs more than once lies in the fact that the applicability of certain methodologies, like those previously described for the outlier identification and those coming afterwards, for the NBL estimation, depends on the probability function that best approximates the available observed data. Moreover, given that the statistic sample can undergo the loss of data, precisely because of the potential exclusion of outliers, it is essential to have the possibility to repeat the analysis in this eventuality. This operation is carried out by applying appropriate tests, such as normal quantile–quantile (Q-Q) plots, Shapiro and Wilk [
31], D’Agostino [
32], and Lilliefors [
33], all of which included and were automatically performed by the software.
This evaluation, which is also performed in the Component Separation method and it is already described in the Blocks CS5-CS7, distinguishes different levels of spatial and temporal “coverage” of the available data and guides the NBL estimation process. There are 4 distinct cases (A, B, C, and D), which are described below and are automatically identified by the software and proposed to the operator according to the spatial and temporal coverage of the data left over by the previous steps.
These datasets exhibit an adequate spatial dimension. Case A, unlike case B, also shows an adequate temporal coverage. There is no substantial difference in the NBLs estimation between these two types of dataset. The only distinction is reflected in a higher level of confidence to be attributed to the NBLs determined for the dataset of case A. The software assigns to the NBL value of the maximum observed median, provided that the dataset is normally distributed. If the dataset shows a non-normal distribution, the NBL is given by the 95th percentile of the identified PDF. In particular, the software automatically sets out whether the observed medians are best approximated by a log-normal or a gamma distribution or whether it is best to normalize the data or finally if a non-parametric distribution is the right solution. The parameters of the recognized probability density function are then estimated to fit the observed data within the same automatic calibration procedure described in Block CS3 and the NBL is calculated following the implementations defined in Block CS4. In the case of a distribution suitable for a normalization process, the data are transformed through the Box-Cox transformation [
34] and then best fitted by a normal PDF to move forward with the same process described above. Non-parametric datasets (set of data that are not satisfactorily approximated by any distribution) are processed through a graphical method whereby the software draws the cumulative frequency curve of the data and identifies, as representative of the NBL value, the one corresponding to the 90th percentile. Finally, the software can also evaluate the NBL through well-known parameters such as the upper tolerance limit (UTL) and upper prediction limit (UPL) [
35].
This type of dataset shows an adequate temporal dimension but a poor spatial coverage. In this case, the procedures described for datasets A and B used to treat the medians of all monitoring wells, which are applied on the data of the single observation point. An NBL value is thus estimated for each of them. The final NBL representing the entire dataset is given by the maximum value among the estimated ones.
When the data does not have a significant dimension in either time or space, it is expected that further data and information must be collected and, in the meantime, an estimate of a provisional NBL can be made if a total number of observations ≥10 is available. In this case, the NBL is equal to the 90th percentile of the whole dataset.