Determining the Relevant Scale to Analyze the Quality of Regional Groundwater Resources while Combining Groundwater Bodies, Physicochemical and Biological Databases in Southeastern France

: In France, the data resulting from monitoring water intended for human consumption are integrated into a national database called SISE-Eaux, a useful and relevant tool for studying the quality of raw and distributed water. A previous study carried out on all the data from the Provence-Alpes-Côte d’Azur (PACA) region in south-eastern France (1061 sampling points, 5295 analyses and 15 parameters) revealed that the dilution of the information in a heterogeneous environment constitutes an obstacle to the analysis of ongoing processes that are sources of variability. In this article, cross-referencing this information with the compartmentalization into groundwater bodies (MESO) provides a hydrogeological constraint on the dataset that can help to better define more homogeneous subsets and improve the interpretation. The approach involves three steps: (1) A principal component analysis conducted on the whole dataset aimed at eliminating information redundancy; (2) an unsupervised grouping of groundwater bodies having similar sources of variability; (3) a principal component analysis carried out within the main groups and sub-groups identified, aiming to define and prioritize the sources of variability and the associated processes. The results supported by discriminant analysis and machine learning show that the grouping of MESO is the best-suited scale to study ongoing processes due to greater homogeneity. One of the eight main groups identified in PACA, corresponding to the accompanying aquifers of the main rivers, is analyzed by way of illustration. Water–rock interactions, redox processes and their effects on the release of metals, arsenic and fecal contamination along different pathways were specifically identified with varying impacts according to the subgroups. We discussed both the significance of the principal components and the mean values of the bacteriological parameters, which provide information on the causes and on the state of contamination, respectively. Based on the results from two different groups of MESO, some guidelines in terms of a strategy for resource quality monitoring are proposed.


Introduction
Since groundwater aquifers are less vulnerable to pollution than surface waters, they are prioritized for supplying drinking water networks. These aquifers can provide information relating to interactions between waters and hosting lithologies that host them. In addition, waters percolating through the surface layers also record information related to land use and can provide information on an aquifer vulnerability to contamination, generally of anthropogenic origin [1,2]. Protecting this resource, which is globally under severe pressure [3,4], involves assessing the most vulnerable areas and compatible land use, and understanding the mechanisms responsible for water characteristics that vary in space and time [5]. Various scales have been recommended to tackle this problem, including a groundwater catchment station, identifying sensitive points (sentinel or monitoring wells), a watershed or groundwater body area, usually presenting low geological complexity, or even a region. The approach also varies, including geochemical and multivariate statistical analysis, endmember mixing, diverse geostatistical procedures for analyzing geochemical structures, and modeling at various scales [6][7][8][9][10][11][12][13][14][15][16].
The physicochemical and biological characteristics of water are widely used as natural markers or tracers in the study of water transfers. This approach relies on the stability of the signatures over time, but also on the spatiality within the water bodies. However, the mechanisms responsible for water quality vary according to lithology, habitation of land by humans, and anthropogenic impact. The enlargement of the areas studied by upscaling then becomes problematic because, for regional scales, the lithology can become heterogeneous, which makes the use of natural markers or tracers problematic. Thus, the management of the heterogeneity of geological characteristics during local to regional upscaling is of scientific interest in resource management [15]. For groundwater bodies, one of the keys to successfully shifting from the local to the regional scale could involve the automatic determination of units exhibiting a certain homogeneity of water-rock interaction processes or more generally of the processes involved in water diversity.
France has abundant surface and groundwater resources [17], but high water demand and widespread emission of pollutants, together with the predicted impacts of climate change, threaten the long-term sustainability of groundwater reserves [18]. Groundwater quality is monitored by the Regional Health Agency (ARS for Agence Régionale de la Santé) within the framework of health surveillance. This monitoring led to creating a spatialized database, named SISE-Eaux containing data on various aspects, i.e., physicochemistry, composition in major ions, microbiological parameters of fecal or non-fecal origin, heavy metals, etc. At the national level, the database includes more than 32,000 catchment stations consisting of 96% groundwater and 4% surface water. The data collected are mainly used for drawing maps of the various parameters and for monitoring quality over time. However, this potentially rich source of information should improve our knowledge of the diversity of the mechanisms (lithological, hydrogeological, pedological) responsible for the acquisition of the groundwater physicochemical characteristics.
Driven by the European Water Framework Directive (WFD) in 2000 [19,20], the Water Agencies with the support of the French Geological Survey (BRGM) inventoried underground water reservoirs on a national scale, which were then subdivided into groundwater bodies (GWB units of the Water Information System for Europe: WISE) referred to hereafter as MESO (for Masse d'Eau SOuterraines in French). MESO corresponds to a distinct volume of groundwater in one or more aquifers, an aquifer representing one or more underground layers of sufficient porosity and permeability to allow either a significant groundwater flow or the capture of large quantities of groundwater. Each MESO is described in a geological survey, and its geographic contours are marked on a GIS (Figure 1). At the national level, MESO are referred to by a unique code ranging from FRXG000 to FRXG999, where FR refers to France, X denotes the major watershed (for example D for the Rhone Mediterranean Watershed) and G refers to groundwater resource. Our research team is currently summarizing and reviewing the water quality intended for human consumption in France. Preliminary research on the Provence-Alpes-Côte d'Azur region, which has a dry and changing climate [18], highlighted the main mechanisms responsible for groundwater chemical and bacteriological characteristics but was hampered by the great diversity of environments [21]. The processing of all the data indeed combines very different environments and mixes different sources of variability specific to each landscape unit or each groundwater body, inducing weak correlations among the parameters. Consequently, the main mechanisms active in any of these units are diluted and hence distort the mass of regional information. This makes them difficult to study by statistical processing, since they only represent a tiny portion of the total variability in the general database. This highlights the need to work on the scale of more homogeneous subsets for a better characterization of the processes involved, and thus facilitating the more effective application of this knowledge to groundwater resource management. In this framework, which objective method, that is to say, unsupervised and using no prerequisites, would the database allow to be divided into coherent subsets in terms of functioning and physicochemical or bacteriological vulnerability? How do meaningful functional units create based on the information contained in these two databases? Is it possible to combine this information on groundwater bodies with a statistical approach? Is the MESO scale relevant for defining a strategy for protecting or monitoring water quality?
The objective of this work is to clarify the diversity of the processes affecting water quality, to study the sources of composition variability and to provide a methodological tool adapted to best manage the resource. This effort is based on combining information conveyed by the ARS SISE-Eaux database and the compartmentalization into groundwater bodies proposed by the BRGM. The multiparameter approach combines different classical statistical mathematical methods aiming to reduce the size of the hyperspace of data, to group together environments with common characteristics to obtain homogeneous and coherent ensembles, to analyze the processes responsible for water quality within each environment, and to specify the type of groundwater vulnerability.

Study Site, the Provence-Alpes-Côte d'Azur Region
The PACA region, which covers an area of 31,400 km 2 in the far southeastern region of France, includes 6 administrative departments, namely Alpes-de-Haute-Provence, Hautes-Alpes, Alpes-Maritimes, Bouches-du-Rhône, Var, and Vaucluse. The main characteristics of this region are reviewed in Tiouiouine et al. [21], which addresses the significant altitudinal gradient (0 to 4000 m), the geological and lithological diversity and its distribution, the main types of aquifers and their flow type, and the hydrographic network dominated by the Rhône River in the West, the Durance basin crossing the PACA region and some coastal rivers (Argens, Var, etc.).

The Spatial Reference System for Groundwater Bodies
The Water Agency and BRGM have defined 80 groundwater bodies (MESO) distributed throughout the PACA region. This large unwieldy number of MESO makes it impractical to (1) study mechanisms presenting variations in water quality, and (2) propose a relevant method of analysis that is easily applicable to other environments. However, this compartmentalization must serve as a basis for identifying homogeneous sets, sorting data and their statistical processing. All PACA groundwater bodies are within the Rhone Mediterranean watershed. Most of them are located in the Durance watershed draining the southern Alps, a major tributary to the Rhone River.

The SISE-Eaux Database
Water samples were collected and analyzed by ARS-PACA laboratories, approved by the Ministry of Health, fulfilling all international certifications and requirements (ISO/IEC 17025) for analytical quality. The dataset used in this work was extracted in two stages from the ARS-PACA database, which is the regional subset of the national SISE-Eaux database. Three different types of analyses were identified, i.e., complete analyses grouping many parameters (up to 100), standard analyses comprising only a few parameters (about 30) and control analyses generally focused on a limited number of parameters (about 10). A first extraction concerned raw water analysis, i.e., before any treatment. More than 10 thousand groundwater samples with up to 30 microbiological and/or physicochemical parameters were obtained. The data are spread over a 10-year period between 10 January 2006 and 15 September 2016. As a result, the matrix obtained from this first extraction was a sparse matrix. The second step was aimed at eliminating empty cells and resulted in 5295 analyses and 15 parameters (79,425 values), namely the biological parameters Enterococci (Enteroc.) and Escherichia coli (E. Coli), electrical conductivity (EC), major ions referred to as HCO3, Cl, SO4, NO3, H + , Ca, Mg, K, Na, metals Fe, Mn, and metalloid As. The pH was transformed into H + ion concentration to avoid mixing parameters with linear or logarithmic scale and thus to provide a homogeneous dataset. All 1061 sampling points ( Figure 1) were georeferenced [21] and then linked to a single MESO. Some aquifers were superimposed, and it was possible to link samples with these groundwater bodies based on the depth of the production wells. For health monitoring, the water resources supplying the main cities were collected at the same point on different dates, sometimes providing up to tens of samples, allowing good repeatability, and temporal representativeness. Monitoring water quality over time is, however, not addressed in this work.

Statistical Treatments
Statistical processing had several objectives. At the scale of the PACA region, the multiplicity of environments and land uses requires an enormous amount of information, in terms of both the number of water analyses and parameters to deliver a detailed picture of the different environments, of their chemical signature, and resource vulnerability. The multiplication of parameters then comes up against "the curse of large dimensions" [22], rendering it difficult to analyze and organize data in high dimension space. The first objective was therefore to reduce the size of the data hyperspace [23]. A principal component analysis (PCA) was carried out by diagonalization of the correlation matrix to transform 15 original variables into 15 orthogonal principal components (PCs) that are a linear combination of the original variables. The treatment reduced the volume of redundant information, and took into account major PCs, i.e., those explaining the highest proportions of the total variance, it was possible to reduce the dimension while minimizing the loss of information, which was attributed to background statistical noise.
The management of protection measures is carried out at the departmental and regional level, that is to say at the level of administrative units of approximately 5000 to 30,000 km 2 , including different natural environments. Such a division of the space is not the most suitable to study groundwater characteristics. On the other hand, MESO are too numerous to correspond to a unit of study easily transposable to other regions in the future. An unsupervised and coherent grouping of MESO, i.e., with similar characteristics, was attempted. Agglomerative hierarchical clustering (AHC) is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It is also known as agglomerative nesting (AGNES). The algorithm treats each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters were combined into one big cluster containing all objects. The result was a tree-based representation of the objects, named dendrogram. An unsupervised AHC was performed using mean values of each groundwater body (MESO) on the principal components. Relative similarities among MESO were quantified using Euclidean distance, and the levels of similarity at which MESO were merged were used to construct the dendrogram.
A discriminant analysis was carried out to determine whether it was possible to differentiate the groundwater bodies using their water chemical and bacteriological composition. This analysis imposes a non-zero variance constraint within MESO. Therefore, MESO with only one water analysis were discarded. A similar discriminant analysis was then performed on the groups of MESO, and the results of both analyses on MESO and groups of MESO were compared.
The discriminant analysis has two drawbacks. The first is the linear nature of the approach, which does not necessarily correspond to the nature of the relationships among the studied parameters. The second is that the effectiveness of the discrimination is evaluated from the same dataset used to find the discriminant function, and not on a dataset independent of the results of the analysis. To overcome these drawbacks, similar discrimination was then performed using Machine Learning methods to predict which MESO group the water samples belonged to. Among the many Machine Learning procedures, the Naive Bayesian Classification was chosen because it requires little training data compared to other procedures. The waters of each group of MESO were classified by increasing EC and one in two samples was chosen to build a training dataset, with the rest of the data used to verify classification performance. In this way, both the training and validation subset, of 2550 samples each, covered the entire concentration range.
A principal component analysis was carried out at the PACA regional scale and recently presented [21] to identify, quantify, and rank the different sources of variability of water quality and explore the underlying processes responsible for this variability. Here, a similar principal component analysis was conducted within each group of MESO and compared to the results obtained at the regional scale. The orthogonal (uncorrelated) principal components ensure the independence of the associated processes. All calculations were performed using the XLSTAT software (Addinsoft).

Grouping Groundwater Bodies
Observations were made in the entire study area, although heterogeneously ( Figure 1). The lowest densities of observations corresponded to sectors characterized by water scarcity or resources deep below the surface, for example in karst areas [24], or in presence of saline water of poor quality for drinking water production, like the Camargue plain in the Rhone delta [25,26]. These sectors only represented small areas. The results of the PCA conducted on the entire dataset are presented in Figure 2 and Table 1.   The first 11 principal components, totaling almost 97% of the information, were retained for MESO agglomerative clustering (AHC). The remaining 3% was considered as statistical background noise. The mean value and standard deviation of each of the 70 MESO on these first 11 factorial axes were calculated and presented in Figure 3 for the first score plot, highlighting that the variability of water characteristics within a MESO is usually higher than among the MESO. The dendrogram obtained by the unsupervised AHC is presented in Figure 4. This processing revealed large sets of MESO, which can be grouped into eight main groups, whose distribution in the first score plot is presented in Figure 5. Seven groups can be divided into several subgroups, totaling 25 sets. The results are summarized in Table 2. Note that group 8 contains three MESO without any similarities with the others, or even between them. The identification of the MESO of each group linked to their FRDG code is available at http://www.sandre.eaufrance.fr. The distribution of the eight main groups in the PACA region is presented in Figure 6. The clustering shows that dissimilarities among the MESO are between 0.01 and 0.4, while the threshold value set to distinguish the different groups is higher than 10. These results show the similarity of the characteristics of the MESO within the same group (with the exception of the above-mentioned group 8) as well as the difference in characteristics among groups, all parameters taken together. Within groups, the dissimilarities between sub-groups are between 0.5 and 2.

Discriminant Analysis and Machine Learning
The results of the discriminant analysis carried out on the MESO are summarized in Table 3. The rate of well-classified samples was very low, only 1.2%. Some MESO had higher rates (41% to 67% of well classified), but these are those with a low number of samples, from 3 to 20. Grouping the MESO using cluster analysis brings together MESO with similar chemical and bacteriological characteristics, and differing from other MESO. The result of the discriminant analysis differed greatly from the previous case, with a well-classified rate of 28.1%, a value that remained low but significant. Groups 3 and 4 differed from other groups with well-classified rates of 84% and 77%, respectively, but they were mainly confused with each other (Table 4). Group 1, the most abundant in terms of the number of water samples, was the least well discriminated against, probably because of (1) its large size (Table  2), (2) the number of MESO it contains, and (3) their variability in composition. Following the learning phase, the prediction of belonging to a MESO based on the chemical composition only led to one well-classified water analysis out of 2550 (Table 3). The success rate was therefore almost zero (0.04%). The self-check on the learning dataset was successful for 28 out of 2550 water samples (1.1% well classified). The prediction of belonging to a group of MESO gave 766 wellclassified samples after the learning phase, i.e., a success rate of 30%, slightly higher but of the same rough estimate than that obtained with the discriminant analysis. The self-check of the learning effectiveness on the same training subset gave 782 well-classified samples, or 30.7%, a result substantially similar to that obtained in the validation subset. Figure 7 illustrates the distribution of cumulative variance carried by the principal components, comparing the PCAs performed on the full dataset [21], then on the groups and subgroups of MESO. Only groups 1 to 4, which had a large number of subgroups, are presented. Generally speaking, by considering the sub-groups for the analysis, the curve shifted upwards, that is to say, the variance carried by the first PCs increased. In other words, the chemical and bacteriological composition of the subgroups was more homogeneous, the weight of the mechanisms associated with each PC was stronger, and their interpretation should be easier.

Figure 7.
Distribution of the cumulative variance on the first 11 principal components for PCA performed on the regional dataset (PACA), groups 1 to 4 and subgroups.

Distribution and PCA in Group 1 and Sub-Groups
A detailed analysis of each group is within the aim of this article, and only group 1 was here detailed. This group of MESO was chosen because it included subgroups with many samples, and was therefore likely to exhibit strong heterogeneity. Figure 7 also highlights that the gain in homogeneity remains moderate within the subgroups. Group 1 mainly included the accompanying aquifers of the lower Rhône valley, of the Durance valley down to its confluence with the Rhone and Sorgues rivers, which are also an affluent of the Rhone. The distribution of the subgroups is detailed in Figure 6a. They were organized in a coherent spatial manner presented from upstream to downstream. Subgroup 1C comprised aquifers of alluvial valleys in the upstream sector of the Durance (Asse Valley and upper Durance) as well as alluvial aquifers of the littoral valleys of the Alpes-Maritimes department (Siagne, Loup and Paillon rivers). Subgroup 1D presented areas with hydrogeological characteristics close to the previous subgroup; this includes the alluvial deposits of the lower Verdon River together with surrounding limestone plateaus, part of the Argens watershed and the Plains of Comtat, occupied by the Sorgues river. Subgroup 1B comprised low gradient aquifers around the Rhône-Durance confluence. They included the Tertiary sandstone and marlylimestone formations from the Lower Durance basin, the Miocene Molasse from the Comtat, the limestone massifs northwest of the Bouches du Rhône department, the formations of the Touloubre watershed and Berre Lake. Finally, subgroup 1A is represented by the alluvial aquifers of the sector downstream from the Rhône-Durance network and comprises the former Durance Delta (the Crau), the middle and lower Durance alluvial deposits, limestones and marls from the Drôme, Roubion and Jabron watersheds, blue clays from the Rhône valley, and marl-limestone deposits and sandstones from hills on the left bank of the "Côte du Rhône" and on the edge of the Comtat basin. Figure 8 summarizes the results of the PCA carried out on group 1, then on each sub-set, and compared with the results obtained from the entire regional dataset [21]. In all cases, the water dissolved load was the main factor in the variation, with a first axis always closely associated with EC. The specifics of each subgroup only appeared on the second factorial axis. For the whole PACA dataset, the bacteriological contamination was carried by the PCs 2 and 3, representing respectively the contamination in carbonate (karst and other formations) and non-carbonate areas. The coordinates of the bacteriological parameters were moderate to low (≈0.6). Considering group 1 as a whole, the bacteriological contamination was concentrated on PC2, with a strong score of the bacteriological parameters. The trend was the same for subgroups 1A and 1C. For group 1B, PC2 was essentially linked to three variables, Fe, Mn and As, strongly correlated and close to 1. Finally, for group 1D, the second factorial axis carried a contrast in the chemical profile between carbonate (positively scored) and chloride waters. The coordinates of the variables (HCO3 and Cl) on this PC2 were moderate. This contrast of chemical profile was found to be carried by the 3rd PC for group 1 as well as for the subgroups 1A and 1B. For group 1C, PC3 also represented an opposite chemical profile, but which concerned SO4 and Cl, carbonate waters being in the intermediate position. PC4 carried secondary processes responsible only for 7% to 10% of the variance, depending on the set or subset. At the regional scale, the 4th PC (7.3% of the variance) identified Fe and Mn-laden waters with lower pH. In the case of group 1, the metals were correlated with carbonate waters poor in nitrates, while Fe and Mn appeared negatively correlated in group 1C. Finally, for 1B and 1D, PC4 carried the information related to the bacterial contamination. Figure 8. Factorial plans PC1-PC2 and PC3-PC4 for PCA performed on regional dataset, group 1 and its subgroups. Table 5 shows the mean values of the parameters within groups 1 and 8. In particular, the values of bacteriological parameters and metals are much higher for group 8, reflecting a high vulnerability to this type of contamination. PCA performed on group 8 highlighted that the first four principal components involve fecal contamination parameters (not shown) suggesting four independent processes.

Groups of Groundwater Bodies
From these results, what relevant scale can we recommend for studying mechanisms responsible for the diversity of waters? Recent studies have shown that addressing the diversity of groundwater at the scale of the PACA region as a whole, that is to say from the whole dataset, is limited [21]. The grouping of samples by groundwater body is possible based on their location and the depth of the collection. However, while MESO are relevant hydrogeological units, carefully identified, described and mapped, the spatial and temporal variability in the water composition within each MESO is important in view of the differences in composition between the MESO (Figure 3). This is reflected by the impossibility of assigning, from its chemical and biological composition, a water sample to the MESO from which it was taken, as revealed by the discriminant analysis and the machine learning procedure. Consequently, by individually studying the MESO, the data processing of water quality by conventional statistical tools remains confused and leads to many redundancies. Neglecting only 3% of the total information contained in the original dataset, the results obtained by agglomerative clustering justify the choice of the spatial scale according to the MESO groups, or sub-groups, for the management of surveillance and protection. The grouping of MESO with similar characteristics ( Figure 5) furnishes coherent, more relevant and effective sets for the analysis of the occurring processes, responsible for groundwater composition. It leads to the individualization of eight large groups of MESO, whose distribution in the area of the PACA region ( Figure 6) has strong similarities with the map of regional aquifers developed by the BRGM [21]. Therefore, despite the temporal variability of the characteristics at each sampling point and despite the internal spatial variability within each groundwater body, the results are consistent with the current knowledge of aquifers established from solid hydrogeological information, meaning that the compartmentalization of the information obtained makes sense. In this grouping, the importance and significance of the chemical profile, which depends on water-rock interactions and marine influence, varies from group to group. Conversely, some large areas correspond to a heterogeneous grouping of geological units such as group 7, which brings together, on the one hand, limestone massifs in the middle mountains subject to a Mediterranean climate (mainly 7A), and on the other hand, moraines or black marls subjected to a more mountainous climate (mainly 7B). The classification into these eight main groups is therefore not only based on geological contrasts and their consequences on water chemistry. Each sub-group has its own hierarchy of mechanisms responsible for the diversity in the characteristics of the waters, and generally presents a gain in homogeneity and therefore consistency, as highlighted by the analysis of the cumulated variance (Figure 7). The heterogeneity of the environment and the diversity of the mechanisms involved sharply decrease by successively considering the regional water dataset, the groups and especially the subgroups of MESO.
Generally speaking, it is the hierarchical clustering which, depending on the case, provides information on the relevant scale of grouping, which may vary according to the study area. Such a grouping into more homogeneous subsets facilitates the study, as illustrated here by the analysis of group 1. This group is made up of the accompanying aquifers of the Rhône, the Durance and their tributaries. We mentioned consistency in the distribution of the eight main groups of MESO, but this is also the case for the subgroups regarding their geographic location, the flow dynamics of groundwater, and land use. Subgroup 1A corresponds to low areas characterized by a low circulation of the aquifers. Subgroup 1B is located further upstream, and corresponds to zones with moderate transit, locally slowed down by natural rock bars. Even further upstream, subgroup 1C corresponds to the alluvial valleys within the cretaceous black marl sector, together with part of the Argens basin [21]. Finally, 1D brings together the Verdon and the other part of the Argens valley, which are impacted by the construction of several dams, and the Sorgues plain, characterized by seasonal oscillations of the water table that may reach up to the surface. Subgroup 1E only has 13 water samples and does not ensure a statistically meaningful number. It was then not discussed here.

Dissolved Load and Chemical Profile
For the regional dataset, main groups or subgroups, the first factor of variability was the dissolved load (Table 1), with a significant weight of the parameter EC (0.983 and 0.982 for the regional dataset and group 1, respectively). Highlighting the water mineral concentration as a main source of variability in a dataset is a classic finding in hydrochemical studies [27,28]. Several processes occurring in the PACA region have been described that can explain this first axis of variability. They are mainly the water-rock interactions, which depend on the lithology and hydrogeological circuits, climate including both seasonal variability [29] and conditions between the plain and colder mountain areas with lower weathering rate, and marine intrusions in coastal environments, themselves reinforced by anthropic freshwater pumping [21]. In addition, there is a strong local seasonal effect with increasing dissolved CO2 during summer due to the growth of vegetation, and which increases the weathering rate in karst and carbonate areas [29,30]. Within the subgroups of group 1, which only includes the accompanying aquifers of the rivers, the marine influence and the alteration of carbonates are to be excluded, and both water-rock interactions and seasonal effect must be retained as a major source of variability, often with shallow aquifers subject to strong evaporation during hot and dry summers.
In addition to the dissolved load, the analysis highlights contrasts in the water chemical profile (Figure 8). PC2 for 1B and PC3 for 1A oppose carbonate to chloride waters, with slightly higher coordinates of these parameters than for the analysis in group 1 as a whole, reflecting better discrimination of this chemical profile contrast. For subgroup 1C, PC3 is mainly driven by SO4. This subgroup mainly concerns low and coastal areas with little karst or carbonate zones. The MESO of this group are mainly influenced by former marine deposits presenting, on the one hand, sulfate evaporites, particularly in the Argens basin, and on the other hand, some pyrite and gypsum lenses in the sector of black soil, from the upper Durance basin. In the latter, toxicity problems in ruminant animals have been reported by the regional health agency (unpublished internal report), sulfates being reduced to toxic sulfides in the rumen of these animals [31].

Bacteriological Contamination
On the scale of the PACA region, fecal contamination is carried by two PCs (Figure 8), which implies that it is associated with two independent processes due to the orthogonality of the factorial axes. Thus, bacteriological contamination cannot be summed up in the context of "rain > runoff > solid load > transport of bacteria", but seems to result from more diverse and complex paths. In group 1, bacterial contamination is a source of variability uneven depending on the subgroup. For subgroups 1A and 1C, it is an important process, as reflected in PC2, essentially driven by bacteriological parameters and representing 11% and 16.1% of the variance, respectively. These two situations are, however, very different. For 1A (607 samples), PC2 is almost exclusively driven by bacteriological parameters with an eigenvalue of 1.92, suggesting that the contamination is a strong source of variability regardless of the water chemical profile, i.e., whatever the geological context. Aquifers belonging to subgroup 1A are shallow, which makes them particularly vulnerable, and associated with high population density, either in the form of urban agglomerations or dispersed rural communities. On the other hand, for 1C (196 samples), the bacteriological contamination must be attributed to the concentration of the population in the alluvial valleys and weak filtration by the materials, essentially consisting of rocks, pebbles, and gravels with strong porosity. The occurrence of karst areas and carbonate materials along the slopes explains why bacteriological parameters are positively correlated with HCO3. Fecal contamination is carried only in PC4 for the sub-groups 1B (101 samples) and 1D (398 samples), with lower coordinates (in the range 0.65-0.76). It is therefore a secondary source of variability for these two latter subgroups, which are more impacted by redox processes and metal contents.

Metals, Arsenic, and Redox Processes
For subgroup 1B, PC2 shows a correlation between the variables Fe, Mn, and As, responsible for approximately 20% of the variance (Figure 8). Tiouiouine et al. [21] have proposed several factors to explain the local presence of anomalous arsenic contents. The first process is sulfides solubilisation (As-containing pyrites), which can affect MESO located at the edge of crystalline basement areas. However, since group 1 only concerns the accompanying aquifers of rivers, this assumption cannot be retained. The second process is the presence of a geochemical background noise of arsenic linked to viticultural practices in the past, in particular the treatment of vineyards with mixtures of Ascontaining pesticides [32,33]. Under non-alkaline pH conditions, arsenic is fixed on metal oxides. In low-redox conditions, in particular near rock bars that slow the flows of the accompanying aquifers, Mn is solubilized, immediately followed by a dissolution of the ferric hydroxides, releasing Fe together with the adsorbed arsenic, which may explain the strong correlation between these three parameters.
Redox processes are also very active in the Verdon and Argens valleys. These valleys are partially submerged upstream from numerous hydroelectric dams, namely the lakes of Caramy, Esparon, Quinson, Artignosc, Sainte-Croix and Castillon. In these sectors, the aquifers develop in old organic levels, where low redox conditions favors the release of high Fe and Mn contents [34]. Consequently, some production wells had to be abandoned because of high metal contents, such as the Montmeyan production well near Lake Quinson. Similar processes can be observed in the Sorgues plain, where irrigation seasonally favors the rise of groundwater up to organic soil horizons.

Focus on Bacteriological Contamination Levels and Surveillance Strategy: Group 1 Versus Group 8
The treatment by groups of MESO makes it possible to identify several contexts and sources of problems which, combined with the intensity of the processes, make it possible to develop a surveillance and monitoring policy that goes beyond the scope of this article. We will detail just one example regarding the vulnerability of aquifers to fecal contamination, a fundamental concern for the protection of the PACA region, as it is responsible for most of the non-compliances of the production wells [34]. Table 5 underlines that, although bacterial contamination is one of the main sources of variability for group 1, it remains of low intensity, when compared with aquifers of group 8 associated with the coastal torrents of the Côte d'Azur. In group 1, the accompanying aquifers located upstream, receiving lateral inflows from steep karstic zones, with suspended matter, supports of bacterial transport, rich in ferric hydroxides, present a worse bacteriological quality. The monitoring strategy must therefore consist of relatively standard monitoring and rather focus on subgroup 1D. On the other hand, aquifers of group 8 present a high vulnerability to bacteriological contamination, and the PCA indicated that four processes are responsible for this situation. The first one, by far the most important, is positively correlated with nitrates and calcium-magnesium carbonate chemical profile. It typically concerns surface inputs in the karstic area linked to rainfall events. The monitoring strategy should therefore be reinforced during periods of stormy events, generally at the end of the summer. Fortunately, these waterbodies only represent a few pumping points of water for human consumption.

Limits of the Method and Difficulties Encountered
This method obviously requires the existence of solid databases on the quality of groundwater, as well as a strong spatial constraint provided by the compartmentalization into groundwater bodies. As this information was supported by the European Water Framework Directive, it should be available in the majority of the European Community countries [19,[35][36][37][38]. In our case, after studying areas ranging from 3000 to 10,000 km 2 , i.e., two or three departments according to the French administrative division, it appeared that studying the sources of spatial variability over larger areas (for example the PACA region of around 30,000 km 2 ) requires the development of an appropriate methodology. We opted for a completely objective approach, that is to say without the use of unsupervised methods, a choice dictated by the desire to be able to transpose the methodology to other regions. The applicability of the method obviously depends on the minimum amount of information on each MESO, which makes it possible to assess its spatial and temporal variability in a relevant way. In this work, one of the difficulties encountered is an unequal spatial distribution of the data, the mountain areas presenting many sampling points each having few samples, while in the plains a lower density of large boreholes supply a large population and are monitored with higher frequency. The goal was to find the homogeneous areas in terms of vulnerability, sources and intensity of contamination (especially fecal), so as to be able to propose a protection and surveillance strategy specifically adapted to the surface studied.

Conclusions
Surveying water characteristics is a relevant tool to assess the quality of groundwater intended for human consumption and monitor its compliance with drinking water standards, but the analysis of the ongoing processes responsible for the acquisition of water characteristics is challenging when the observations and parameters involved are numerous and the spatial variability is high. Thus, in the Provence-Alpes-Côte d'Azur region, in southeastern France, a previous study has shown that the regional statistical approach greatly dilutes and distorts the information, which hinders its processing. To avoid this, after assigning each sample to a groundwater body according to its location and the depth of the production well, we propose a 3-step approach. (1) A principal component analysis performed on the whole dataset allows for eliminating redundancies of information and thus reducing the dimension of the hyperspace of data without significant loss of information. (2) A hierarchical cluster analysis based on the average coordinates of groundwater bodies on the factorial axes leads to the grouping of MESO showing similar sources of variability, i.e., similar ongoing processes. The results obtained, which are consistent with the regional distribution of the major aquifers, lithology, and landscapes, validates this grouping. (3) A multivariate analysis performed within each group of MESO is then easier because it only considers more homogeneous hydrogeological units, within which the sources of variation are comparable. This approach allows better discrimination of the sources of variability, associated processes and pathways. Two results validate this analysis. First, a study by discriminant analysis and machine learning confirms that the relevant scale for the study of the information contained in the dataset is neither the sampling point nor the groundwater body, but the grouping of the groundwater body. Second, there is consistency between the sources of variability detected (water minerality, fecal contamination, lithology, redox process, and other anthropogenic factors) and our knowledge of the factors of well non-compliances derived from field studies. This approach makes it possible to envision an adapted, specific and relevant monitoring of resources intended for human consumption. It is a significant step forward for sanitary surveillance, a basis for implementing the European Union's Water Framework Directive (WFD). The methodology here shows its effectiveness on a regional scale, and we plan to test its application on the scale of large watersheds such as the Rhône River, a very different scale that will probably require an adapted approach.