Groundwater Quality Assessment: An Improved Approach to K-Means Clustering, Principal Component Analysis and Spatial Analysis: A Case Study

: K-means clustering and principal component analysis (PCA) are widely used in water quality analysis and management. Nevertheless, numerous studies have pointed out that K-means with the squared Euclidean distance is not suitable for high-dimensional datasets. We evaluate a methodology (K-means based on PCA) for water quality evaluation. It is based on the PCA method to reduce the dataset from high dimensional to low for the improvement of K-means clustering. For this, a large dataset of 28 hydrogeochemical variables and 582 wells in the coastal aquifer are classiﬁed with K-means clustering for high dimensional and K-means clustering based on PCA. The proposed method achieved increased quality cluster cohesion according to the average Silhouette index. It ranged from 0.13 for high dimensional k-means clustering to 5.94 for K-means based on PCA and the practical spatial geographic information systems (GIS) evaluation of clustering indicates more quality results for K-means clustering based on PCA. K-means based on PCA identiﬁed three hydrogeochemical classes and their sources. High salinity was attributed to seawater intrusion and the mineralization process, high levels of heavy metals related to domestic-industrial wastewater discharge and low heavy metals concentrations were associated with industrial wastewater punctual discharges. This approach allowed the demarcation of natural and anthropogenic variation sources in the aquifer and provided greater certainty and accuracy to the data classiﬁcation.


Introduction
Researchers use multivariate analysis to study the temporal and spatial characteristics of water quality [1][2][3][4][5][6][7][8].Because current water quality assessment standards are not uniform and there are multiple and complex sources of water contamination, K-means clustering and principal component analysis (PCA) are widely used in water quality analysis and management [3,5,9,10].Traditional clustering and PCA methodology for the evaluation of water quality is performed on the entire high dimensional hydrochemical dataset [3,9].Clustering high dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions.This approach allows identification of the significant characteristics or parameters that define the data structure by PCA and K-means allows the grouping of monitoring stations by similarities between the samples.Nevertheless, numerous studies have pointed out that K-means with the squared Euclidean distance is not suitable for high-dimensional data clustering because of dimensionality [11,12].One alternative to solving this problem is to use alternative distance functions [13].Another option to tackle this problem is to employ dimension reduction for high dimensional data.It is achieved apart from the traditional methods such as the Principal Component Analysis (PCA), with Multidimensional Scaling and Singular Value Decomposition [11,12].
The frequent problem in the application of cluster analysis is to decide an optimal number of clusters which suitably fit a data set.A significant advantage of the K-means algorithm is that it can evaluate the clustering results with the application of cluster validity indexes quantitatively and objectively, perceiving how many clusters are hidden in the dataset [11,14,15].Cluster validity or clustering evaluation is formally defined as giving objective evaluations to clustering results in a quantitative way [16].They are a necessary but challenging task in cluster analysis.It has even been stated that clustering validation be regarded as decisive as the clustering itself [11].Every clustering algorithm will discover clusters even in a dataset that has no natural cluster structure.Indeed, cluster validity has become the core task of cluster analysis, for which a significant number of validation measures have been proposed and, with precision, analyzed in the literature [15].Geographic information systems (GIS) can provide helpful tools for the manipulation and analysis of spatial information but is not complete for multivariate statistical and spatial studies [17].The use of GIS as a visual tool allows the researcher to explore statistical outputs that would otherwise be difficult to interpret [18].
For the reasons previously mentioned, we present a methodology for water quality evaluation-called the K-means based on PCA method-to reduce the dataset from high dimensional to low for the improvement of K-means clustering in addition to spatial analysis (GIS).No systematic study into its benefits over traditional clustering methods has been performed.It allows the identification of the dominant hydrogeochemical processes controlling groundwater chemical composition and to achieve more comprehension of natural and anthropogenic sources influencing the spatial distribution of groundwater quality in the coastal aquifer Santo Domingo, Mexico.

Study Case: Santo Domingo Aquifer
The Valley of Santo Domingo (SD) is located in the state of Baja California Peninsula in Mexico and it lies between 24.9 • 36.1 30.3" of latitude north and 111 • 23.8 26.8" of longitude west (Figure 1).The SD Valley climate is arid, the mean annual temperature shows a broad span, between 1.9 • C and 43.3 • C. The average annual precipitation reaches only 150 mm, with minimum and maximum values of 50 and 300 mm/year, respectively [19].The rainfall period occurs between July and September and the dry one from April to June.The annual evapotranspiration is 2270 mm/year, with maximum values of 370 mm/month [20].The SD basin is the most extensive agricultural region in the Baja California Peninsula in Mexico.The agricultural extension covers an area of 72,409 ha, 49.4% of them are irrigated.During the agricultural period from 2013 to 2014 water withdrawals were 158,579 m 3 [21].

Geology Setting
The geology of the Valley of Santo Domingo encloses two regional geological provinces: Purísima Sub-basin and The Giganta Volcanic Belt.The Purísima Sub-basin comprises Triassic sedimentary rocks and partially serpentized ultramafic rocks [22].There are three main lithological formations included in the Purísima Sub-basin: Paleocene lutites from Santo Domingo Formation and Paleocene Eocene sandstone and lutite sequences of Tepetate Formation.The geological structure of the Purísima Sub-basin consists of a syncline around 600 km long, occupied by Upper Cretaceous to Lower Tertiary sediments [23].
The Giganta Volcanic Belt, located in the eastern region, is 500 km long by 30-50 km wide and comprises volcanic material as well as pyroclasts, lava flows and breccias, in addition to continental sand from the Comondú Formation [24].The Salt Formation is widespread mainly in the central area.It is constituted by sand particles of quartz, feldspar and igneous rocks.The maximum thickness (185 m) is found in the northern region and it depends on the subsoil structure since the Quaternary sediments are assorted (fluvial, eolian and alluvial).Previous researchers reported high levels of carbonates in some aquifer zones, such as calcrete layers and a cementing material [24,25].

Hydrogeology Setting
The study area comprises part of three hydrographic basins: in the North by the Santo Domingo basin, in the middle part of the Las Bramonas watershed and to the South by the Santa Cruz basin [20,26].It is defined as an unconfined granular aquifer composed principally of the Salts Formation and Quaternary sediments [27].
The distribution of hydraulic conductivity in the SD Basin is unequal.The amount of silt and clay in the Salt Formation increases towards the west, so a low hydraulic conductivity can be deduced.
In the SD Basin groundwater system, water flows from recharge areas in the western edge of the Sierra La Giganta Mountains towards the west in the Pacific Ocean [25].
In 1957, the elevation of the static water level in wells was above sea level (a.s.l.) in the entire SD Valley, whereas in 1996 the water level was 20 m a.s.l. in the direction of the east, but, in the center of the SD basin, the water level in wells was below mean sea level.Two local drawdown cones between 20 and 25 m below sea level (b.s.l.) are demarcated in the central area [25].Previous studies in the 1980s reported high extractions reaching withdrawals of up to 450 Mm 3 /year (million m 3 per year) and around 2.4 times the annual average recharge [25].In the following years, the average annual recharge (188 Mm 3 ) nearly equaled the extraction rate (168 Mm 3 ) [26].

Water Sampling
Groundwater samples from 600 agricultural use wells (Figure 1) were collected at depths between approximately 16 m and 83 m between March and July 2010 in the aquifer area.Duplicate 250 mL polyethylene bottles were utilized; one contained HNO 3 added to 2 mL of 0.02 N for metals determination.The other sample was kept unacidified for anions and cations analysis [28,29].
Also, significant cations such as Ca 2+ , Mg 2+ , Na + and K + were determined by atomic absorption (AAS) and a flame photometer (Model: Systronics Flame Photometer 128, SYSTRONICS, Ahmedabad, India).A titration procedure determined the HCO where all cations and anions are specified as milliequivalents per liter.The ion balance errors for all groundwater samples ranged from −1.7 to 9%.

Research Approach
The approach used for this research involved five main components: Phase I Database pretreatment; Phase II K-means clustering on the original high dimension dataset; Phase III, to reduce dimensional space; Phase IV, K-means clustering on the low dimensional dataset; Phase V, cluster evaluation.The RStudio v. 1.0.153(Copyright RStudio Inc., Boston, MA, USA) was utilized to perform descriptive statistics, study the distribution of the measured variables, the correlations matrix, perform the principal components analysis (PCA) and the cluster analysis (CA) (Figure 2).

Phase I Database Pretreatment
Eighteen wells exhibited incomplete data or were found to be outliers and they were removed from the original dataset (600 registers).Data that included values frequently lower than the detection limit of the method were excluded.When no recognition of ions was recorded, they were completed by the mean values of the neighboring data [30].
Most multivariate statistical methods require a log-normal data distribution.Therefore, the Kolmogorov-Smirnov (K-S) statistics were applied to test the goodness-of-fit of the data to a log-normal distribution.A 95% confidence log-normal distribution was obtained with a high significance level for all the parameters (p < 0.05) according to the K-S test [31].
Spearman's rank was used to analyze the correlation between variables to account for the non-normal distribution of water quality parameters [32].Kaiser Meyer Olkin (KMO) and Bartlett's Sphericity statistics were performed to test the data accuracy and suitability for PCA, on the parameter correlation matrix.KMO is used to measure the sampling adequacy, which indicates the proportion of shared variance, that is, what might be caused by unknown factors.A high value (close to 1) commonly indicates that PCA may be useful as confirmative in this study (KMO 0.79).Bartlett's test of Sphericity is employed to check the null hypothesis corresponding to uncorrelated variables [31].It afforded a significance level less than 0.05, indicating excellent relationships among variables.

Phase II K-Means on the Original High Dimensional Dataset
The R package NbClust provided 24 indexes (Table S1) to determine the optimal number of clusters in a dataset [15].The K-means algorithm based on within-cluster variation is a measure to form homogeneous clusters [14].The clustering process starts with initial choosing observations and numbers the desired cluster to create initial centers-also called cluster centers-setting out from some initial values known seed-points.Each observation was placed randomly in a cluster to which it is closest, creating temporary clusters.K-means clustering was formulated as the sum of squared errors as is shown in the below equation [14,33,34]: where is known as the centroid of cluster C l , 1 ≤ l ≤ K; n l is the number of data objects in the cluster and K is the number of clusters.
The gravity centers of each temporary cluster are calculated and these become the new cluster centers.The data are partitioned randomly and iteratively reassigned to another cluster based on the nearest distance to the cluster's center [14,33].The procedure finishes when there is no reassignment to any data from one cluster to another [35].

Phase III to Reduce Dimensional Space
Principal Component Analysis (PCA) was employed to identify the most meaningful hydrochemical parameters.
The principal component (PC) approach is expressed as: where is the component score; a is the component loading; x is the measured value of the variable; i is the component number, j is the sample number and m indicate the total number of variables [9,36].Similar studies were successfully applied to assess water quality [3,36,37].

Phase IV K-Means Clustering on the Low Dimensional Dataset
We analyzed the efficiency of the K-means clustering in low-dimensional space.It was integrated by the reduced projected dataset ( Ŷ) into a new coordinate axis by applying Ŵ to X.
where Ŵ is the transformation matrix consisting of the most significant PC and X is the matrix of the measured values.Its computational complexity is dominated by the M-step.The distance between each data point to all K centroids is computed: O(nKp), where n, K and p are the number of data points, the number of clusters and the dimension of the data, respectively.Thus, our approach utilizes this advantage because we work in the PCA subspace ( Ŷ) with a small dimensionality.

Phase V Cluster Evaluation of Technical Quality
Various measures are used to quantify the "goodness" of a cluster solution.In a good cluster solution, the elements within a cluster are similar to one (cohesive) while the clusters themselves are entirely different (separated) [38].A popular measure is the silhouette coefficient, which is a measure of both cohesion and separation [39].The silhouette measure varies from −1 to +1.In a good solution, the within-cluster distances are small and the between-cluster distances are extensive, resulting in a silhouette measure close to the maximum value of 1. Clusters obtained were contrasted with the spatial visualization of wells by ArcGIS v.10.3 (Copyright ESRI Inc., Redlands, CA, USA), to analyze the relationship between their characteristics and the activities on the surface.
The hydrochemical data were subjected to graphical treatment by plotting them in Piper´s Trilinear and using AquaChem v. 5.1 software (Copyright Waterloo Hydrogeologic, Waterloo, ON, Canada) and Stiff plots mapping was carried out using AquaChem v. 5.1 and ArcGIS v. 10.3.These methods were useful for understanding and identifying the water composition in different types.

Hydrochemical Analysis
Analysis of the 28 hydrochemical variables of groundwater in the study are presented in the descriptive statistical summary in Table 1.The water samples were slightly acidic to alkaline with pH values between 6.40 and 8.78 and electrical conductivity (EC) values ranging from 0.01 to 8.76 ∂S/m.Total dissolved solids (TDS) showed wide ranges of values from 326.4 to 5606.4 mg/L, with an average of 1285.10 and a standard deviation of 765.79 (Table 1), indicating wide variations in the concentrations of TDS attributed mainly to the seawater intrusion also reported in previous research [25].Chloride concentrations with values ranging from 42.54 to 456.32 mg/L showed wide dissimilarities which are observed in the standard deviation of 355.44 (Table 1).Therefore Na + , Ca 2+ , HCO  2 shows the main parameters obtained by the Spearman's rank correlation matrix, from which can be observed a highly significant positive correlation of 0.90, 0.95, 0.92 and 0.82 between Ca 2+ and Mg 2+ , Cl − and TDS, Cl − and Mg 2+ , in addition to Ca 2+ and TDS respectively (Figure 3a-d).These demonstrate the significant contribution of these elements to the mineralization processes and salinization.Reference [40] studied variations in coastal aquifer groundwater and reported similar correlations (>0.9) between Cl − , Na + , Mg 2+ and TDS.
A very high correlation (0.90) between Ca 2+ and Mg 2+ suggests the dissolution of calcite (CaCO 3 ) and dolomite CaMg(CO 3 ) 2 (Figure 3a) [41,42], the main component of sedimentary rocks, existing in these zones.High correlations of 0.92 and 0.83 between Cl − with Mg 2+ and Ca 2+ (Figure 3e) and low correlations between HCO 3 − with Ca 2+ and Mg 2+ (Figure 3f) confirm that the salinization process is crucial in the aquifer [43].A good correlation (0.85) between Sr 2+ and Cl − shows the aquifer mineralization process.Also, Br − concentrations are well correlated with Cl − and Mg 2+ (0.75 and 0.75, respectively).These facts corroborate that their high levels are attributed to seawater intrusion in the aquifer.Low concentrations of SO 4 2− with an average of 29.23 mg/L (Table 1) were observed throughout the aquifer.Despite NO 3 − levels being mostly within the maximum permissible limits [44,45], the highest values observed were between 20 and 30 mg/L.The NO 3 − boost could originated from agricultural and livestock activities when shallow static levels are present, allowing a direct infiltration of pollutants.
Therefore, high concentrations of Co 2+ , Cr 3+ , Cu 2+ , Fe 3+ , Mn 2+ , Ni 2+ and Pb 2+ (Table 1) were located near to urban and agricultural areas, these concentrations exceed the permissible values for drinking water [44] and irrigation [45] (Table S2 and Figure S1).Huge Cu 2+ concentrations between 8.83 and 3.8 mg/L were observed nearby to settlements.High concentrations of Fe 3+ between 3.2 and 4.2 mg/L also occur near urban areas.Moreover, critical levels of Pb 2+ and Mn 2+ (0.49 and 0.33 mg/L, respectively) are observed in the North and East.A high positive correlation of 0.92 between Cr 3+ and Zn (Table 2) could indicate that industrial activities are its principal source.

Water Types
The hydrochemical data analysis of the samples was plotted using Piper' Trilinear diagram (Figure S2).Also, the samples were plotted with a Stiff diagram and projected with ArcGIS v. 10.3 (Figure 4).Piper's diagram is a suitable method to reveal the relations, dissimilarities and to classify water types based on the ionic composition of different groundwater samples [46].The main water types were identified and categorized on the basis of significant ion concentrations, arranged in decreasing order of abundance, where the percentage reveals the amount of groundwater samples that fall into a water type.The Piper's diagram shows that the main water types identified in the SD basin are mixed water types where these types cannot be recognized, due to neither anions nor cations being dominant, considering that no one cation−anion pair exceeds 50%.

Mixed of Ca
The plot showed that 38% of the groundwater samples belong to mixed water of Ca 2+ -Mg 2+ -Na + -Cl − -HCO 3 − .This water type covers the central and northeastern areas, in intermediate-depth wells between 30 to 60 m.It observed that this combined water is located nearby to SD and Bramonas rivers, irrigation canals and natural flow streams (Figure 4).The spatial location of wells suggests that this water class is related to urban, industrial and agricultural wastewater discharges in water bodies.

Mixed of Ca
The Piper´s diagram revealed that 30.8% of groundwater samples pertain to a combined water of Ca 2+ -Mg 2+ -Na + -Cl − .This water class includes the northern and western regions, where deep wells (60 to 80 m) are located near to settlements and shallow wells (16.91 to 30 m) occur nearby coastal zones (Figure 4).The spatial location of wells indicates that this water type is associated with wastewater discharge from anthropogenic sources and in addition to natural processes (intrusion seawater and mineral dissolution) in deep wells and shallow wells respectively.

Mixed of Na
The plot showed that only a few groundwater samples (13.7%) belong to mixed water of Na + -Mg 2+ -Cl − -HCO 3 − .This water type covers most of the northern region, intermediate-depth wells between 30 and 60 m, generally located near agricultural zones and far from urban settlements (Figure 4).The spatial location of wells suggests that this water type is related to anthropogenic wastewater discharges from agricultural, industrial and urban activities.The results are similar to those reported in Reference [25].

K-Means Clustering on High Dimensional Dataset
Partition method K-means clustering was applied to a large dataset of 28 hydrogeochemical variables and 582 wells.Ten indexes from the R package NbClust suggested that wells are grouped in three clusters (See Table S3).Hence, further non-spatial and spatial analyses were performed based on this criterion.
The partition method K-means clustering, employed on a big dataset, resulted in three groups of wells.The spatial analysis showed high scattering among wells that belong to the same group (Figure 5).Wells within each group disclosed dissimilar hydrogeochemical characteristics.The cluster validity was carried out with the Silhouette index, which was from 0.13 as shown in Figure S3.Therefore, K-means clustering used on a large dataset revealed a grouping that was not suitable.

Principal Component Analyst (PCA)
Principal component analysis was used to decrease the dimensional space of the large dataset in order to improve the clustering.PCA considerably reduced 28 hydrogeochemical variables to 16 variables.PCA revealed that four components explain 71.6% of the total variance, with the salinization process and anthropogenic activities being the main factors controlling the groundwater quality variability.The PCA results are shown in Table 3.The PCA approach identified four components that have the most critical loading (Figure 6).Principal component analysis (PCA) plot of variable space deduced from the geochemical analysis.

PCA2
The second component (PCA2) explains 16.55% of the total variance and was assembled by Cr 3+ (0.91), Cu 2+ (0.86), Pb 2+ (0.79) and Zn (0.79), showing high correlations among themselves toward the same direction (Table 3).The main variables (Cr 3+ , Cu 2+ , Pb 2+ and Zn 2+ ) within PCA2 demonstrated an increase caused by metal pollution in the Valley (Figure 6).High concentrations occur near urban areas and small settlements.In this component, the primary cause of pollution is industrial activity.Furthermore, this process is reinforced by the high levels of Cl − in the aquifer through Chloride complexation.This process makes metal mobility in the aquifer easy [47].

PCA3
The third component (PCA3) accounts for 12.07% of the total variance and describes the significant contributions of HCO 3 − (−0.84),K + (0.76) and the static water level (0.82) (Table 3 and Figure 6), disclosing good correlations among themselves.PCA3 revealed an inverse correlation between HCO 3

−
and the static water level, demonstrating that when increasing the static water level, concentrations of HCO 3 − decrease.High concentrations of HCO 3 − are found in shallow wells.Commonly, K + concentrations in aquifers are low (<10 mg/L); otherwise, external sources overcome the permitted limit [48,49].The high K + concentrations (10 to 24.24 mg/L) occur in urban areas and near agricultural zones.This result indicates that domestic wastewater discharges and potassium fertilizers are essential sources (Figure 7).

PCA4
The fourth component (PCA4) explains 9.02% of the variance and shows the highest correlation between Mn 2+ (0.89) Fe 3+ and (0.76) (Table 3 and Figure 6), conducted toward the same direction.Both elements are naturally simultaneous and PCA4 stated the same source for Fe 3+ and Mn 2+ ions.Higher concentrations are observed near the urban areas; they could be released to the environment by any industrial and domestic wastewater punctual discharge, as there is no mining activity in the zone.

K-Means Clustering on Low Dimensional Dataset
The low dimensional dataset was obtained by Equation ( 4), where Ŵ is the matrix of the main four PCs found and X is the matrix of values observed and ( Ŷ) is the matrix of transformed data (582 X 4).We realized K-means clustering on a low dimensional dataset and showed significant improvement in the grouping.K-means clustering on a low dimensional dataset was achieved to increase quality cluster cohesion according to an average Silhouette index.The index ranged from 0.13 (Figure S3) for high dimensional K-means clustering to 5.94 (Figure S4) for K-means based on PCA.Both groupings were projected spatially for practical evaluation.K-means based on PCA clustered samples with similar hydrogeochemical characteristics, showing higher quality results (Figure 8 and Table 4).4).The groundwater class is mostly affected by salinity, caused by different factors such as the mineralization process, seawater intrusion, industrial and urban wastewater discharges, infiltration of leachates from open dumps and poorly designed sanitary landfills, fertilizer application and water over pumping in deep wells.Deep wells are located near to urban and agricultural areas while shallow wells are proximate to coastal zones (see red zones in Figure 8).
For example, high concentrations of K + are observed mainly in class 1 (Figure 7a) and could be attributed to domestic and agricultural wastewater discharges.Groundwater samples from class 1 and class 3 collected from wells belong to agricultural and livestock zones and urban zones (Figure 8).Smaller concentrations of K + showed in class 2 (Figure 7a) due to the fact that pollutants do not reach distant zones.Water samples from class 2 were collected from regions further away from settlements and agricultural activities.
Figure 7b shows high concentrations of Al 3+ , Pb 2+ , Cu 2+ and Cr 3+ in class 2. Industrial and domestic wastewater are poured into drainage irrigation canals and natural flow streams and cause damage [50].Wells have intermediate depths, located near channels and flow streams such as Bramonas and the Santo Domingo rivers (see yellow zones in Figure 8).

Third Class: 256 Wells
Class 3 has better water quality and therefore it is less influenced by industrial and domestic wastewater discharges.The slight increases of 0.01, 0.01, 0.01 and 0.02 mg/L (Table 4) in ion concentrations such as Co 2+ , Mn 2+ , Ni 2+ and Pb 2+ respectively could be attributed to industrial wastewater punctual discharges and agriculture and natural geological conditions.Deeper wells are mainly dedicated to agricultural irrigations and are located in this zone (see orange zones in Figure 8).

Discussion
K-means clustering based on PCA allows the finding of the location of 160 wells (class 1).Shallow wells are located mainly in coastal zones and deep wells are located in urban and agricultural areas of the aquifer.They are related to variables highly significant for PCA1 (TDS, EC, Cl − , Na + , Mg 2+ , Ca 2+ and SO 4 2− ) and PCA3 (HCO 3 − , K + and static level).High concentrations of these species found in wells belong to class 1 and class 3 respectively.The water quality of the wells is related to the salinization process and seawater intrusion, the primary source controlling the groundwater quality variability in the aquifer.The K-means algorithm grouped 166 wells (class 2) near to drainage irrigation canals and flow streams.They were linked to the significant variables of PCA2 (Cu 2+ , Zn 2+ and Cr 3+ ) and PCA4 (Fe 3+ ).High levels of these ions observed in wells belong to class 2 and class 1 respectively.In these wells, the water quality influenced by heavy metals sourced from industrial and domestic wastewater poured into surface streams.
The third class includes 250 wells (class 3) located mainly in agricultural zones in the center and north aquifers.They correlated with the variables of PCA2 (Pb 2+ and Zn 2+ ) and PCA4 (Mn 2+ ).High concentrations of these species found in wells belong to class 3.In this group of wells, the primary source influencing the groundwater quality is the location of industrial wastewater punctual discharges.Class 3 has better water quality than classes 1 and 2.
In addition to the previous results, the improved method (K-means clustering based on PCA) allowed definition of the characteristics and hydrogeochemical processes of the aquifer.The determination of hydrogeochemistry is essential for establishing a conceptual model and distinguishing the natural processes of an aquifer.However, the factors that affect hydrochemistry are mainly controlled by stochastic processes, so they vary in time and space.For these reasons, the sources of variation must identify for the correct determination of the hydrogeochemical model.In this case study, there is excellent spatial coverage, 600 wells distributed throughout the aquifer.However, little attention is paid to temporal variation, so the analysis does not allow us to analyze the temporal factors that cause variation in water quality, such as a change due to randomness (storm, rainfall) and changes due to climatic seasons (temperature, rainfall).Therefore, these monitoring systems show a static but excellent spatial representation of the hydrochemical structure of the aquifer and allow us to analyze the principal sources of variability by cross-referencing with the spatial distribution of water quality.

Conclusions
The proposed method achieved improvement of the cluster cohesion Silhouette index ranging from 0.13 for high dimensional k-means clustering to 5.94 for K-means clustering based on PCA and practical spatial GIS evaluation of clustering indicated high-quality results for K-means based on PCA.
K-means clustering based on PCA identified three hydrogeochemical classes and their sources.High salinity was attributed to seawater intrusion and mineralization process, high levels of heavy metals related to domestic-industrial wastewater discharge and low heavy metals concentrations were associated with industrial wastewater punctual discharges.This approach allowed the demarcation of natural and anthropogenic variation sources in the aquifer and provided greater certainty and accuracy of the data classification.
Three hydrochemical classes of groundwater were identified.The first class (Cluster 1): Shallow wells and deeper wells consistent with proximity to the coast and urban zones respectively.This class of water was associated with high salinity, which comes from seawater intrusion and the mineralization process.The second class (Cluster 2): Intermediate and deep wells located in urban-industrial zones and settlements.This type of water correlated with high concentrations of heavy metals, which it could attribute to domestic-industrial wastewater discharge in drains and flow streams.The third class (Cluster 3): Deeper wells located mainly in agricultural and urban areas.This water-type related to a slight increase in some heavy metals, which was attributed to industrial wastewater punctual discharges.
The proposed method showed a static but excellent spatial representation of the hydrochemical structure of the aquifer and allowed us to analyze the principal sources of variability by cross-referencing with the spatial distribution of water quality.The improved method could be applied to optimize sampling and to schemes monitoring groundwater quality.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2073-4441/10/4/437/s1, Figure S1: Comparison between the concentrations of ions and Maximum Permissible Limit (MPL) recommended by national standards for drinking water and irrigation water, Figure S2: Groundwater types plotted on a Piper trilinear diagram, Table S1: Indexes provided by R NbClust Package, Table S2: Comparison of the groundwater quality in relation to drinking water and irrigations standards, Figure S3: Silhouette plot of the clustering with high-dimensional dataset, Figure S4: Silhouette plot of the clustering with low-dimensional dataset, Table S3: Number of clusters suggested by each index.

Figure 1 .
Figure 1.Location map of the study area showing the distribution and identification of sampled wells, main rivers, settlements and urban zones.

Figure 2 .
Figure 2. K-means clustering processes followed in the study.

Figure 4 .
Figure 4. Spatial distribution map of the water types demonstrated by Stiff diagram.

Figure 5 .
Figure 5. Spatial distribution of groups using K-means clustering on high dimensional dataset.

Figure 6 .
Figure 6.Principal component analysis (PCA) plot of variable space deduced from the geochemical analysis.

Figure 7 .
Figure 7. Clustering wells (class 1, class 2 and class 3), related to the hydrogeochemical dataset resulted.(a) Parameters mainly linked to salinity (b) Parameters linked to heavy metals pollution.

Table 1 .
Statistical summary of hydrochemical parameters of groundwater.

Table 2 .
Spearman's rank correlation matrix for the groundwater quality data.
Note: Coefficients greater than 0.5 are underlined.

Table 3 .
Principal component and varimax rotated component matrix.
Note: A Rotation method: varimax with Kaiser normalization.

Table 4 .
Statistics of the three categories found from the K-means clustering analysis.Dominant parameters are EC, TDS, Cl − , Ca 2+ , K + and Na + with average values of 3.39, 2169.94,872.41, 217.96, 9.18 and 287.90 respectively (See Table