Investigating Hydrochemical Groundwater Processes in an Inland Agricultural Area with Limited Data : A Clustering Approach

Groundwater chemistry data are normally scarce in remote inland areas. Effective statistical approaches are highly desired to extract important information about hydrochemical processes from the limited data. This study applied a clustering approach based on the Gaussian Mixture Model (GMM) to a hydrochemical dataset of groundwater collected in the middle Heihe River Basin (HRB) of northwestern China. Independent hydrological data were introduced to examine whether the clustering results led to an appropriate interpretation on the hydrochemical processes. The main findings include the following. First, in the middle HRB, although groundwater chemistry reflects primarily a natural salinization process, there are evidence for significant anthropogenic influence such as irrigation and fertilization. Second, the regional hydrological cycle, particularly surface water-groundwater interaction, has a profound and spatially variable impact on groundwater chemistry. Third, the interaction between the regional agricultural development and the groundwater quality is complicated. Overall, this study demonstrates that the GMM clustering can effectively analyze hydrochemical datasets and that these clustering results can provide insights into hydrochemical processes, even with a limited number of observations. The clustering approach introduced in this study represents a cost-effective way to investigate groundwater chemistry in remote inland areas where groundwater monitoring is difficult and costly.


Introduction
Multivariate statistics have been widely used to analyze complex and high-dimensional datasets in hydrological research [1][2][3][4].Clustering, a robust classification scheme for partitioning a dataset into homogeneous groups [5], is a typical multivariate statistics technique that has been used for numerous hydrological applications, such as rainfall intensity estimation [6], drought frequency analysis [7], stream turbidity predictions [8] and watershed regionalization [9].Hydrochemical datasets of groundwater samples usually include multiple attributes (e.g., concentrations of various ions, isotopes or other chemicals) and therefore contain valuable information about a variety of regional hydrochemical processes, including rock-water interactions [10], surface water-groundwater interactions [11], evaporation [12], saltwater intrusion [13] and agricultural fertilization [14].
Each groundwater sample is an observation of the system and can be treated as a realization of the system's random attributes.
Different clustering methods have been used to analyze multivariate hydrochemical groundwater data [1,5,[14][15][16][17].These clustering methods fall into two main categories: heuristic data mining methods and probability model-based methods [6].Clustering methods can also be classified as "crisp" or "hard" methods (i.e., an observation belongs exclusively to a single cluster) and "fuzzy" or "soft" methods (i.e., an observation belongs to all clusters with different degrees of membership) [16].In the field of hydrology, heuristic methods are commonly used.Some heuristic methods, such as hierarchical clustering methods [1,17] and k-means clustering [18], are "crisp", whereas others, including the fuzzy c-mean method [15,16], are categorized as "fuzzy".In general, the fuzzy c-mean method is more reliable for dealing with hydrochemical data than the "crisp" methods because the physical and chemical properties of a hydrological system usually vary continuously (in both space and time) rather than abruptly [16].Because fuzzy clustering provides degrees of membership, rather than clear-cut distinctions, it can better reflect the spatial continuity of a hydrological system.
Unfortunately, all heuristic methods have several major weaknesses.First, because "similarity" is defined in heuristic methods in terms of measured distances, such as Euclidean distance [19], observations belonging to similar correlations may be incorrectly grouped if they record long distances.Second, the process of selecting an appropriate clustering method and determining its parameters is usually subjective.There are no quantitative criteria to determine the distance metrics and key clustering parameters, such as the number of clusters and the initial cluster centers.In contrast to heuristic methods, model-based methods do not use distance measures.Instead, these models assume that a given dataset contains several sub-populations that should be modeled separately and that the overall population represents a mixture of these subpopulations [20].Observations belonging to the same sub-population should enter the same cluster.Each sub-population can be represented by a parametric distribution (e.g., a Gaussian distribution), and the entire dataset can be modeled as a mixture of multiple distributions [21].Because model-based methods are not restricted to distance measures and adopt a rigorous probability framework [22], they provide more reliable and meaningful clustering results in certain situations.Figure 1 shows such an example.Additionally, model-based methods can use quantitative criteria, such as the Bayesian Information Criterion (BIC), as objective metrics to determine the best models and number of clusters [23,24].It has been demonstrated in other fields that model-based methods can overcome the limitations of heuristic methods [20].However, model-based methods have rarely been applied to hydrological studies [14,25].This study used GMM clustering on a hydrochemical dataset of groundwater samples collected from a semi-arid agricultural area in northwestern China.Its main objectives were to (1) investigate whether and how GMM clustering can be effectively used to address hydrochemical datasets in areas where both natural and anthropogenic factors exert significant effects on groundwater chemistry and (2) demonstrate how GMM clustering can produce an integrated understanding of basin-scale Selecting an appropriate probability model is essential for model-based clustering.The Gaussian Mixture Model (GMM) has been widely applied to studies involving pattern recognition and machine learning, information processing, data mining, and clustering [26].This model has two major advantages: first, it can adequately approximate a broad class of distribution functions, and second, the mathematical form of the GMM simplifies the derivation of the subsequent parameter estimation method [27].
However, the GMM has rarely been adopted in model-based clustering to explore hydrochemical groundwater data.To the authors' best knowledge, [14] represents the first attempt, but this pioneering application left some important issues to be further investigated.In the case study of [14], the groundwater chemistry was dominated by anthropogenic impacts, and only two ions were considered to be attributes in the final model.Therefore, it remains unclear whether GMM clustering can be used to understand more complicated hydrochemical processes (i.e., processes with both natural and anthropogenic impacts).Additionally, the clusters of groundwater samples studied by [14] did not record clear spatial patterns.Therefore, further work is needed to determine whether GMM clustering can produce an integrated understanding of groundwater chemistry at a large scale.
This study used GMM clustering on a hydrochemical dataset of groundwater samples collected from a semi-arid agricultural area in northwestern China.Its main objectives were to (1) investigate whether and how GMM clustering can be effectively used to address hydrochemical datasets in areas where both natural and anthropogenic factors exert significant effects on groundwater chemistry and (2) demonstrate how GMM clustering can produce an integrated understanding of basin-scale hydrochemical processes with a limited number of observations.

Study Area
The Heihe River is the second largest inland river in China, with a total length of approximately 900 km.It originates within the Qilian Mountains, flows north and terminates at East Juyan Lake.The entire Heihe River Basin (HRB) (Figure 2a) consists of three distinctive parts: the mountainous upstream area, which is mainly located in Qinghai Province, the semi-arid midstream area with intensive oasis agriculture, which is mainly located in Gansu Province, and the downstream area, which mainly comprises the vast Gobi Desert in Inner Mongolia.On the main river, Yingluoxia and Zhengyixia (Figure 2b) are the separating points of upstream-midstream and midstream-downstream, respectively.With a total area of approximately 130,000 km 2 , the HRB has an arid continental climate; its annual precipitation ranges from 50 to 300 mm.The Heihe River has more than thirteen tributaries, but some have lost their surface water connections with the main river.The interactions between surface water and groundwater are substantial and complex in the HRB [28].The HRB has been affected by significant human-nature water conflicts (i.e., midstream vs. downstream, and agriculture vs. ecological services), as discussed in previous studies [28][29][30].
This study focuses on the midstream area of HRB (Figure 2), which is a major part of the Hexi Corridor, a region that was once crossed by the famous ancient "Silk Road".Its annual precipitation is only 100-150 mm, but it receives a significant amount of surface runoff from the Qilian Mountains, where the annual precipitation exceeds 350 mm [31].The middle HRB has a long history of irrigated agriculture in its oases.Based on the water resource reports of the local government, farmlands have rapidly expanded over the past several decades, with agriculture now consuming more than 90% of this region's water supply.Thus, industrial and domestic water uses have very limited impacts on the hydrological and hydrochemical processes in the middle HRB.To secure the environmental flow towards the lower HRB and restore the diminishing terminal lake, the central government enforced a tight restriction on surface water diversion for irrigation in 2000.However, as an immediate response to the flow regulation, groundwater pumping in the middle HRB has increased rapidly since 2000, causing groundwater drawdown and wetlands degradation in some areas.In the study area, from south to north, the sediments gradually change from coarse-grained gravel to medium and fine-grained sand and then to silt [32,33].A fault along the foot of the Qilian Mountains largely prevents groundwater from flowing from the mountains into the basin laterally.Thus, streams from the mountain area are the main recharge source for the aquifers.In the south part of Minle County, the aquifer is formed from highly permeable cobble and gravel deposits with a thickness more than 200 m.The depths to the water table range from 50 to 200 m in Ganzhou district and 0.5-5 m in the north part of the floodplain (Linze and Gaotai Counties).Groundwater discharges to river or as springs at the middle part of the study area.In our previous study, we performed In the study area, from south to north, the sediments gradually change from coarse-grained gravel to medium and fine-grained sand and then to silt [32,33].A fault along the foot of the Qilian Mountains largely prevents groundwater from flowing from the mountains into the basin laterally.Thus, streams from the mountain area are the main recharge source for the aquifers.In the south part of Minle County, the aquifer is formed from highly permeable cobble and gravel deposits with a thickness more than 200 m.The depths to the water table range from 50 to 200 m in Ganzhou district and 0.5-5 m in the north part of the floodplain (Linze and Gaotai Counties).Groundwater discharges to river or as springs at the middle part of the study area.In our previous study, we performed integrated surface water-groundwater modeling using GSFLOW [34] to study the complex water cycle of the HRB and its response to human activities [29,31,35].It has been found that this area has significant surface water-groundwater interaction, and the interaction is complicated by intensive pumping and irrigation [30,31,35].Figure 2c plots the groundwater table and flow directions based on Tian et al.'s model [28].
The surface water salinity increases from upstream to downstream, and the water type shifted gradually from HCO 3 − , SO 4 2− to Cl − , mostly due to water-rock interactions [10,[36][37][38], which is typical of arid to semi-arid areas.Nonetheless, little attention has been paid to the groundwater chemistry of this region.Some studies [10,32,33,39] provided basic descriptions of the hydrochemical features of the entire basin, but none have performed in-depth investigations of the agricultural area in the middle HRB.Additionally, the previous studies mostly discussed natural chemical evolution processes while overlooking anthropogenic impacts.How the agricultural activities, such as diversion, pumping, irrigation and fertilization, would impact the groundwater chemistry deserves further investigation.

Sample Collection and Treatment
This study focuses on groundwater used for agricultural irrigation.In a field campaign in August 2014, 73 groundwater samples were collected from irrigation wells (see Figure 2b) in Zhangye, Jiuquan and Jinta Counties, which contain the majority of the population and agriculture in the middle HRB.These wells were created for irrigation and mainly draw water from the uppermost aquifer, which can provide sufficient groundwater for irrigation.Most of these irrigation wells pump groundwater from unconfined shallow aquifers (in which the groundwater depth varies from 0.7 to 170 m) that interact with surface water.Therefore, these samples record important information about anthropogenic impacts on the groundwater system.A few irrigation wells in Minle County (Figure 2) draw groundwater at depths of 200-300 m.
At each sampling site, groundwater was pumped out for more than 15 min before any samples were taken to clean the well tube.In the field, HCO 3 − and CO 3 2− contents were determined using titration; and electrical conductivity, temperature and pH were measured using portable equipment.When collecting samples for laboratory analysis, water was first filtered using a 45-µm cellulose membrane and stored in an acid-cleaned 30-mL high-density polyethylene (HDPE) bottle.Samples were stored in a cooler and then transferred to a refrigerator on the same day.All samples were measured for Na + , Mg 2+ , Ca 2+ , K + , Cl − , and NO 3 − in the laboratory.These measurements were conducted at the Geochemistry Laboratory of the Cold and Arid Region Environmental and Engineering Institute, Chinese Academy of Sciences (Lanzhou, Gansu Province, China), following the standard methods developed by the American Public Health Association [40].Dissolved cations were measured using chromatography (DX-600 of Dionex) by an inductively coupled plasma optical emission spectrometer, and anions were measured using the ICS-2500 chromatograph of Dionex.All analyses followed the same standard methods and were carefully calibrated by an appropriately diluted standard.The accuracy and precision of these measurements were checked by analyses of reference materials.All relative errors are within ±5%.

GMM Clustering
The GMM considers an entire dataset to be a mixture of K clusters of Gaussian distributions, in which each cluster is associated with a weight ω k .This dataset can be in the form of an observation matrix, X (n × p), where n is the sample size (i.e., the number of observations) and p is the number of attributes.The ith observation is denoted as x i , which represents an independent realization from one of the K clusters.However, it is unknown to which cluster x i belongs [23].If µ k (p × 1) and D k (p × p) denote the expectation vector and the covariance matrix of the kth cluster (k = 1, 2, . . ., K), respectively, then we can further define θ k = {µ k , D k }.In this case, all observations are independent identically distributed (IID) vectors, and a mixture probability density function (pdf) (i.e., the probability model), with specific K and p values, can be written as where g k represents the kth Gaussian distribution and can be further written as [23]: The entire parameter set, , can be estimated using a maximum likelihood estimation (MLE) approach.We first define a label vector z ik .If the ith observation comes from the kth cluster, then z ik = 1; otherwise, z ik = 0.Because the observations are IID, the likelihood function can be defined as In Equation ( 3), direct MLE is not feasible because the cluster label z ik is unobserved.In such cases, the Expectation-Maximization (EM) algorithm [41] can be applied to obtain the optimal Θ, as introduced in Section 3.4.
After the MLE is completed, the membership probability of each observation belonging to the kth cluster, denoted as m ik , can be calculated as [21]: It holds that K ∑ k=1 m ik = 1.The ith observation can then be assigned to the cluster with the highest membership probability.The label z ik can then be determined as Additionally, for each observation, the clustering uncertainty can be calculated as [6]: In order to demonstrate the advantage of GMM clustering, k-means clustering, an approach widely used in various fields including hydrology [18], was also performed and compared with GMM clustering.The technical details of K-means clustering can be found elsewhere [42,43].

Model Selection
Selecting the appropriate K and p attributes is critical to the performance of the GMM clustering.Once the values of the K and p attributes are determined, the structure of the probability model (i.e., Equation ( 1)) is fixed, and MLE can be performed to further estimate the parameter set In this study, the Bayesian Information Criterion (BIC) was used to determine the K value and p attributes and is formulated as [24]: where L is the log-likelihood estimated by MLE (Equation ( 3)), λ is the number of parameters to be estimated, and n is the sample size (n = 73).Each cluster model with p selected attributes and a specific value of K achieves a BIC score; the model with the lowest BIC score can be interpreted as the best model [44].
In the GMM, λ can be calculated as [23]: The first item in this equation represents the total number of elements in the K expectation vectors (µ k ), and the second item represents the total number of covariance coefficients contained in all the D k 's.Obviously, λ increases rapidly with both K and p, which can lead to an increase of the BIC score, unless there is a significant improvement in L.
As a model-based clustering method, GMM clustering also suffers from the "curse of dimensionality" because it is likely to cause over-parametrization in high-dimensional spaces [23,45].Data preprocessing for dimension reduction is one solution to overcome this obstacle [46], and a classic technique is principal component analysis (PCA) [47].In some applications, primal variables are more suitable for clustering than principle components [14].In this study, to find a suitable set of p attributes for clustering, both the primal concentration variables and their principle components (PCs) were considered candidate attributes.Hydrochemical data often follow non-normal distributions, and therefore appropriate data processing is necessary before any parametric analyses [48].In this study, PCA was conducted for the standardized logarithms of the concentrations, as is commonly done in hydrochemical studies [16].

Expectation-Maximization Algorithm
The well-known EM algorithm can solve MLE problems using unobserved variables, and it has previously been applied to GMM clustering [48,49].In this case, the parameter set to be estimated via MLE, where θ k = {µ k , D k } represents the parameters of the kth Gaussian distribution.Referring to Equation (5), z ik is an unobserved variable depending on the dataset X n×p and the parameter set Θ.If t denotes the iteration number and the initial t = 0, then the key steps of the EM algorithm can be summarized as follows.
Step 4: Calculate the likelihood L using Equation (3).Set t = t + 1, and repeat Step 2 and Step 3 until the convergence condition is satisfied.A general convergence requirement is to reach |L (t+1) − L (t) | < ε. ε is a threshold that was set to 0.01 in our study.
More details of the EM algorithm can be found elsewhere [41,49].This algorithm is sensitive to the initial parameter values (i.e., Θ(0) ) and usually requires a large number of initial points to achieve adequate MLE results.In this study, for each candidate GMM, the EM calculation was repeated 10,000 times with randomly generated initial points.The estimated Θ value yielding the lowest BIC score was then considered the optimal parameter value for the model.All the above analyses were conducted using MATLAB.The MATLAB code of GMM-EM is provided by https://github.com/HammerZhang/GMM.The kmeans function in MATLAB was used to perform k-means clustering.

Descriptive Statistics
For our samples, the concentration of total dissolved solids (TDS) ranges from 307.6 to 4164.6 mg/L, with an average value of 1245.6 mg/L.According to [50], 15 samples can be categorized as non-saline drinking and irrigation water (TDS ≤ 500 mg/L), 40 samples as slightly saline irrigation water (500 mg/L < TDS ≤ 1500 mg/L), and 18 samples as moderately saline, primary drainage water (1500 mg/L < TDS ≤ 7000 mg/L).The concentrations of TDS and individual ions were first log-transformed, and then standardized to have zero mean and unit variance.Kolmogorov-Smirnov tests performed with MATLAB validated that the transformed concentration data follow a standard normal distribution.For Na + , K + , Mg 2+ , Ca 2+ , Cl − , SO 4 2− , NO 3 − and HCO 3 − , the p-values of Kolmogorov-Smirnov test are 0.0994, 0.433, 0.501, 0.581, 0.054, 0.399, 0.068 and 0.223, respectively.Thus, the hypothesis that the transformed concentrations follow a standard normal distribution cannot be denied at the confidence level of α = 0.95, and parametric statistics are applicable to the transformed data.Table 1 presents the Pearson correlation coefficient matrix of the log-transformed concentrations.NO 3 − does not show any significant correlation with other variables, with all p-values higher than 0.05.All other pairs record significant correlations, with p-values less than 0.01.In particular, the pairs of Na + -Cl − , Mg 2+ -Ca 2+ , Na 2+ -SO 4 2− , Cl 2+ -SO 4 2− and Mg 2+ -SO 4 2− have correlation coefficients greater than 0.85.In fact, we also calculated Spearman's rank correlation coefficients [51,52].The nonparametric results are very similar to the parametric ones, except that weak correlation between NO 3 − and several cations is identified.In a typical arid and semi-arid river basin, most ions would have a tendency to increase from upstream to downstream because of the rock weathering and lixiviation processes.Our data also reflect this tendency.The correlation between TDS and the ions are all positive, and the correlation coefficients are very high, mostly above 0.8, except those for K + , NO 3 − and HCO 3 − .K + has similar properties as Na + , but its abundance in natural waters is much lower [52], and it is intensively absorbed by plants [53].NO 3 − in groundwater is not significantly influenced by rock-water interactions.Instead, it reflects the influences of anthropogenic activities, such as fertilization in irrigated farmlands, where excessive nitrate may leach into groundwater [14,54].Among the 73 samples, 8 samples have NO 3 − concentrations greater than 44.3 mg/L, which exceed the drinking water standards defined by the WHO.For HCO 3 − , the dominating impact of rock-water interactions may be reduced by the infiltration, which can transport surface water with relatively abundant dissolved CO 2 (generated by plant respiration and organic matter decay) to the groundwater [55].
As these results indicate, the hydrochemical features of the groundwater are likely influenced by many processes, including rock weathering and lixiviation processes, fertilization, plant uptake, and surface water-groundwater interactions.

PCA Results
As Table 2 shows, the top four PCs individually explain 61.5%, 14.4%, 11.5% and 7.5% of the total variance and collectively explain 94.8% of the variance.Thus, PCA has successfully reduced the data dimension from 8 to 4. As shown by and HCO 3 − in groundwater reflect the influence of surface water.In addition, previous studies [56] have found that the SO 4 2− and Cl − contents in surface water were significantly lower than those in the nearby groundwater.Thus, PC2 can indicate the influence of surface water and represent the surface water and groundwater interaction.PC3 features a high positive loading in HCO 3 − and a high negative loading in NO 3 − .It may thus reflect the relative importance of natural and anthropogenic impacts on groundwater chemistry.In an agricultural area such as the middle HRB, fertilization can contribute a significant amount of NO 3 − to shallow groundwater through surface water infiltration and groundwater recharge [14].Therefore, a high PC3 score indicates that the groundwater has not been significantly impacted by fertilization.PC4 has a high positive loading in Na + and high negative loadings in other three cations: Ca 2+ , K + and Mg 2+ .Therefore, it may represent the cation exchange process of clay particles, which is common in arid and semi-arid environments [53,54,57].

Clustering Results
Candidate models with different K values and attribute sets were examined, and Table 3 shows the likelihood and BIC values associated with the twelve models.Models 1 to 9 record different combinations of K values and PCs, and Models 10 and 11 consider Cl − , NO 3 − and HCO 3 − as its three attributes.These three ions are selected because their correlations are relatively weak (Table 1) and may represent distinctive aspects of hydrochemical processes.Model 12 considers all ions except SO 4 2− and Na + and the pH as its attributes.SO 4 2− is excluded because its concentration is linearly related to those of Cl − , Na + and Mg 2+ (Table 1).Similarly, Na + concentration is linearly related to that of Cl − .Among all of the candidate models, Model 7 has the lowest BIC value (195.00) and is therefore considered to be the best model.this model, PC1, PC2 and PC3 form the best set of attributes, and the appropriate number of cluster is 6.It is clear that the PCA effectively reduces the data dimensions from eight to three and significantly improves the overall clustering performance.These three PCs can explain over 90% of the total variance.In [14], the first four PCs explained no more than 80% of the total variance, which may be why primal variables outperformed PCs.In addition, the result of Model 12 proves the necessity of dimension reduction before clustering.Figure 3a-c illustrate the PC scores of the first five clusters based on Model 7 of GMM clustering.Compared to Model 6 which identifies fiver clusters, Model 7 separates two samples from the second cluster to form the sixth cluster, leading to a lower (i.e., better) BIC.All the first five clusters show significant correlations between the PC scores.With only two samples, the sixth cluster demonstrates no pattern, and is therefore excluded from our discussion and not presented in Figure 3.For comparison, the results of k-means clustering are presented in Figure 3d-f.This classic approach is based on distance, usually Euclidean distance, and requires the number of clusters be pre-defined.Interestingly, when we set the number of clusters to six, the same two points forms a separate cluster.Thus, Figure 3d-f also plot five clusters.It is evident that the clusters based on k-means clustering show no clear patterns in the PC spaces.Thus, the comparison in Figure 3 clearly reflects the advantage of GMM clustering in pattern recognition (refer to Figure 1).This advantage would enable a sophisticated interpretation of the hydrochemical processes.
As was discussed before, PC1, PC2 and PC3 indicate salinity (positive relationship), the influence of surface water (positive relationship) and the impact of fertilization (negative relationship), respectively.In the PC1-PC2 space (Figure 3a), a positive (negative) relationship between PC1 and PC2 may indicate that surface water recharge enhances (reduces) groundwater salinity.In the PC2-PC3 space (Figure 3b), a negative relationship between PC2 and PC3 may suggest that groundwater is receiving fertilization-impacted surface water, which is only evidence for CLUSTER 4. In the PC1-PC3 space (Figure 3c), a negative relationship between PC1 and PC3 (most evident in CLUSTER 4) may indicate that fertilization enhances groundwater salinity.In contrast, the positive relationship between PC1 and PC3 (most evident in CLUSTER 2) probably reflects a common situation in arid areas that high salinity of irrigation water can limit agricultural development [58,59].According to our field investigation and existing studies [60], the agricultural productivity of the farmlands along the river within Gaotai County, where the samples in CLUSTER 2 are located, is indeed influenced by salinity.Table 4 summarizes the main features of the five clusters demonstrated in Figure 3a-c.The groundwater in CLUSTER 1 has a moderate level of salinity (indicated by PC1), which can be enhanced by surface water recharge (indicated by PC2 and the PC1-PC2 relationship).Although salinity is influenced by fertilization (indicated by PC3), its effect is not through the recharge from surface water (indicated by PC2-PC3 relationship) but is probably through lateral groundwater flow.In CLUSTER 1, groundwater salinity may limit agricultural development (as indicated by the PC1-PC3 relationship).Compared to CLUSTER 1, CLUSTER 2 has a much wider range in salinity (as indicated by PC1), but the variation in salinity is not due to the recharge (indicated by PC1-PC2 relationship).The impact of fertilization is weaker in CLUSTER 2 than it is in CLUSTER 1 (as indicated by PC3).CLUSTER 3 has the lowest salinity (as indicated by PC1).Although surface water does influence groundwater chemistry (as indicated by PC2), it does not significantly change groundwater salinity (as indicated by the PC1-PC2 relationship).There is also no evident connection between groundwater quality and agricultural development in this case (as indicated by the PC1-PC3 relationship).As in CLUSTER 1, although the salinity in CLUSTER 3 is influenced by fertilization (as indicated by PC3), this effect is likely due to horizontal groundwater flow instead of surface water recharge.For CLUSTER 4, a distinctive characteristic is the fertilization-impact recharge (as indicated by PC3), which enhances groundwater salinity (as indicated by PC1 and the PC1-PC2 relationship) and degrades the quality of the groundwater (as indicated by the PC1-PC3 relationship).CLUSTER 5 has the highest salinity level of all five clusters (as indicated by PC1), which can be reduced by recharge (as indicated by the PC1-PC2 relationship), probably because the recharge water has a much lower salinity than the groundwater.The chemistry of the groundwater is also influenced by Table 4 summarizes the main features of the five clusters demonstrated in Figure 3a-c.The groundwater in CLUSTER 1 has a moderate level of salinity (indicated by PC1), which can be enhanced by surface water recharge (indicated by PC2 and the PC1-PC2 relationship).Although salinity is influenced by fertilization (indicated by PC3), its effect is not through the recharge from surface water (indicated by PC2-PC3 relationship) but is probably through lateral groundwater flow.In CLUSTER 1, groundwater salinity may limit agricultural development (as indicated by the PC1-PC3 relationship).Compared to CLUSTER 1, CLUSTER 2 has a much wider range in salinity (as indicated by PC1), but the variation in salinity is not due to the recharge (indicated by PC1-PC2 relationship).The impact of fertilization is weaker in CLUSTER 2 than it is in CLUSTER 1 (as indicated by PC3).CLUSTER 3 has the lowest salinity (as indicated by PC1).Although surface water does influence groundwater chemistry (as indicated by PC2), it does not significantly change groundwater salinity (as indicated by the PC1-PC2 relationship).There is also no evident connection between groundwater quality and agricultural development in this case (as indicated by the PC1-PC3 relationship).As in CLUSTER 1, although the salinity in CLUSTER 3 is influenced by fertilization (as indicated by PC3), this effect is likely due to horizontal groundwater flow instead of surface water recharge.For CLUSTER 4, a distinctive characteristic is the fertilization-impact recharge (as indicated by PC3), which enhances groundwater salinity (as indicated by PC1 and the PC1-PC2 relationship) and degrades the quality of the groundwater (as indicated by the PC1-PC3 relationship).CLUSTER 5 has the highest salinity level of all five clusters (as indicated by PC1), which can be reduced by recharge (as indicated by the PC1-PC2 relationship), probably because the recharge water has a much lower salinity than the groundwater.The chemistry of the groundwater is also influenced by fertilization (as indicated by PC3), which has a (indicated by PC2 and PC2-PC3 relationship) but instead from horizontal groundwater flow degrading effect (as indicated by the PC1-PC3 relationship).However, this influence may be not from recharge that brings nitrogen from further upstream in the groundwater flow field.A Piper diagram was also created for all groundwater samples (Figure 4).The fact that the five clusters are much better separated in the anion space implies that anions, rather than cations, determine the spatial variation of groundwater chemistry in this area.Both CLUSTER 3 and CLUSTER 1 have low salinities and relatively high HCO a common natural salinization process that occurs from upstream to downstream in semi-arid basins [33,39], whereas CLUSTER 4 mainly reflects the impact of agriculture.
Water 2017, 9, 723 12 of 19 fertilization (as indicated by PC3), which has a (indicated by PC2 and PC2-PC3 relationship) but instead from horizontal groundwater flow degrading effect (as indicated by the PC1-PC3 relationship).However, this influence may be not from recharge that brings nitrogen from further upstream in the groundwater flow field.Note: * "Limiting" refers to the effect that groundwater salinity limits the agricultural development."Degrading" indicates the effect that fertilization enhances groundwater salinity.
A Piper diagram was also created for all groundwater samples (Figure 4).The fact that the five clusters are much better separated in the anion space implies that anions, rather than cations, determine the spatial variation of groundwater chemistry in this area.Both CLUSTER 3 and CLUSTER 1 have low salinities and relatively high HCO3 -contents.In contrast, CLUSTER 5 has very high salinity dominated by SO4 2− and Cl − .CLUSTER 2 and CLUSTER 4 are between the extremes and are not well partitioned in the anion space because NO3 − was not included in the Piper diagram.The transition of CLUSTER 3 → CLUSTER 1 → CLUSTER 2 → CLUSTER 5 reflects a common natural salinization process that occurs from upstream to downstream in semi-arid basins [33,39], whereas CLUSTER 4 mainly reflects the impact of agriculture.Figure 5 illustrates the spatial and probabilistic details of the clustering results.In Figure 5a-e, the size of each circle indicates the probability that a sample belongs to a given cluster, whereas in Figure 5f, the size of the circle indicates its clustering uncertainty (Equation ( 6)). Figure 6 illustrates the regional groundwater flow field simulated by an integrated surface water-groundwater model [35].According to Figure 5, the five clusters have different high-probability locations (HPLs).The HPLs of CLUSTER 1 (Figure 5a) and CLUSTER 4 (Figure 5d) are mixed in Ganzhou District and Linze County, which have intensive agriculture (refer to Figure 2).However, the hydrochemical processes differ significantly between the clusters (see Table 4), as discussed above.The HPLs of CLUSTER 2 Figure 5 illustrates the spatial and probabilistic details of the clustering results.In Figure 5a-e, the size of each circle indicates the probability that a sample belongs to a given cluster, whereas in Figure 5f, the size of the circle indicates its clustering uncertainty (Equation ( 6)). Figure 6 illustrates the regional groundwater flow field simulated by an integrated surface water-groundwater model [35].
According to Figure 5, the five clusters have different high-probability locations (HPLs).The HPLs of CLUSTER 1 (Figure 5a) and CLUSTER 4 (Figure 5d) are mixed in Ganzhou District and Linze County, which have intensive agriculture (refer to Figure 2).However, the hydrochemical processes differ significantly between the clusters (see Table 4), as discussed above.The HPLs of CLUSTER 2 (Figure 5b) are mainly located in Gaotai County (Figure 2), which is downstream of the HPLs of CLUSTER 1 and CLUSTER 4, in which farmlands are confined to a narrow strip along the river.The HPLs of CLUSTER 3 (Figure 5c) are distributed in the areas upstream of the regional groundwater flow field (Figure 7).The HPLs of CLUSTER 5 (Figure 5e) are mainly located in Jinta Basin, which is in the most downstream part of the groundwater flow field (Figure 7).These spatial patterns reinforce our previous argument that the transition of CLUSTER 3 → CLUSTER 1 → CLUSTER 2 → CLUSTER 5 reflects the common natural salinization process of groundwater moving from upstream to downstream in semi-arid basins, whereas the data of CLUSTER 4 mainly reflect the impact of agriculture.
Water 2017, 9, 723 13 of 19 (Figure 5b) are mainly located in Gaotai County (Figure 2), which is downstream of the HPLs of CLUSTER 1 and CLUSTER 4, in which farmlands are confined to a narrow strip along the river.The HPLs of CLUSTER 3 (Figure 5c) are distributed in the areas upstream of the regional groundwater flow field (Figure 7).The HPLs of CLUSTER 5 (Figure 5e) are mainly located in Jinta Basin, which is in the most downstream part of the groundwater flow field (Figure 7).These spatial patterns reinforce our previous argument that the transition of CLUSTER 3 → CLUSTER 1 → CLUSTER 2 → CLUSTER 5 reflects the common natural salinization process of groundwater moving from upstream to downstream in semi-arid basins, whereas the data of CLUSTER 4 mainly reflect the impact of agriculture.

Regionalization of Groundwater Chemistry
Clustering has been widely applied in regionalization studies [6,9,61].Conversely, reasonable regionalization results can also validate the clustering results [25].Figure 6 demonstrates that, in this study, GMM clustering has resulted in a much more interpretable spatial pattern than k-means clustering.For example, in Figure 6a, the samples in CLUSTER 4 are concentrated in the productive farmlands, and the samples in CLUSTER 5 are mainly located in the Jinta and Jiuquan Basins.In contrast, based on k-means clustering (Figure 6b), all the five clusters have scattered samples.
Five sub-regions were further identified, as illustrated in Figure 7.In Sub-region I, CLUSTER 1 and CLUSTER 4 are the major groundwater types.Within these areas, the clustering uncertainty is

Regionalization of Groundwater Chemistry
Clustering has been widely applied in regionalization studies [6,9,61].Conversely, reasonable regionalization results can also validate the clustering results [25].Figure 6 demonstrates that, in this study, GMM clustering has resulted in a much more interpretable spatial pattern than k-means clustering.For example, in Figure 6a, the samples in CLUSTER 4 are concentrated in the productive farmlands, and the samples in CLUSTER 5 are mainly located in the Jinta and Jiuquan Basins.In contrast, based on k-means clustering (Figure 6b), all the five clusters have scattered samples.
Five sub-regions were further identified, as illustrated in Figure 7.In Sub-region I, CLUSTER 1 and CLUSTER 4 are the major groundwater types.Within these areas, the clustering uncertainty is

Regionalization of Groundwater Chemistry
Clustering has been widely applied in regionalization studies [6,9,61].Conversely, reasonable regionalization results can also validate the clustering results [25].Figure 6 demonstrates that, in this study, GMM clustering has resulted in a much more interpretable spatial pattern than k-means clustering.For example, in Figure 6a, the samples in CLUSTER 4 are concentrated in the productive farmlands, and the samples in CLUSTER 5 are mainly located in the Jinta and Jiuquan Basins.In contrast, based on k-means clustering (Figure 6b), all the five clusters have scattered samples.
Five sub-regions were further identified, as illustrated in Figure 7.In Sub-region I, CLUSTER 1 and CLUSTER 4 are the major groundwater types.Within these areas, the clustering uncertainty is relatively high (Figure 5f), which implies that this sub-region features a series of highly complicated hydrochemical processes.Sub-region I has very productive farmlands with intensive fertilization, and the surface water-groundwater exchanges in this sub-region are strong in both directions and have been significantly influenced by human activities such as diversion, pumping and irrigation [35].Additionally, both the surface and subsurface water-rock interactions in Sub-region I are active [10,32,33].Sub-region II is downstream of Sub-region I.The dominance of CLUSTER 2 in this sub-region (Figure 7) implies that the mixing of surface water and groundwater does not significantly affect groundwater salinity in this area.In fact, the strong water-rock interaction in this area causes salinity to gradually increase from upstream to downstream [56].In contrast to Sub-region I, this sub-region features a relatively small area of farmland, which is mainly distributed in a narrow strip along the river (Figure 5b); therefore, the anthropogenic effect is not significant.
Sub-region III is mainly located in the Jinta Basin, and is dominated by CLUSTER 2 to CLUSTER 5, the two downstream groundwater types.Both clusters are not significantly impacted by surface water, as is indicated by Table 4, which is consistent with the fact that this sub-region has low precipitation, high evapotranspiration, and a relatively weak hydraulic connection with the main Heihe River (Figure 7).Sub-region IV represents the upstream part of the groundwater flow field, where groundwater salinity is low.The groundwater depth in this region is very deep (200-300 m); therefore, agricultural activities can only weakly influence the groundwater quality.Sub-region V occupies the northwest region of the study area, and mainly includes Jiuquan Basin.It has a mixture of five clusters and is a relatively isolated area with a weak hydraulic connection with the main Heihe River.
Figure 7 shows that GMM clustering produces satisfactory regionalization results, which in turn validate the clustering analysis.Although many regionalization methods, such as the approaches based on the Tyson triangle and Kriging interpolation, rely on the geographical locations of observations, the GMM clustering approach requires no location information.Clearly, GMM clustering can effectively identify connections between observations based on underlying physical processes, rather than geographical distance.Compared to the work [14], the clusters of samples in our study are more spatially assembled, which helps develop an integrated understanding of the regional hydrochemical processes.

Impact of Regional Water Cycle
To further validate these clustering results, the Spearman's rank correlation [51] is examined between the membership probabilities calculated by Equation ( 4) and a series of key hydrological variables representing climatic, topographical and agricultural factors.Here, we further define m k = [m 1k , m 2k , • • • m nk ] T as the membership vector of the kth cluster.Hydrological variables have either been observed (e.g., rainfall, surface water diversion for irrigation) or modeled (e.g., evapotranspiration and infiltration rates) by [35].This statistical analysis considers the annual average values (from 2000 to 2012) of these variables within the 9-km 2 domain in which each sampling site lies (i.e., a 3 × 3 modeling grid with the site in the central square).The rationale behind considering temporally averaged values is that groundwater chemistry undergoes change relatively slowly and thus largely reflects the long-term impacts of hydrological processes [61].The rationale to perform spatial averaging is that groundwater chemistry may reflect not only the hydrological processes at a specific site but also those in adjacent areas.
Table 5 presents the calculated correlation coefficients of this analysis.These results are consistent with previously discussed findings.First, precipitation is positively correlated with m 1 , m 3 and m 4 but has no significant correlation with m 2 and m 5 .Given that Table 4 shows, CLUSTERs 1, 3 and 4 have a common feature that the influence of surface water is at least at a medium level, this correlation with precipitation likely reflects the influence of surface water on groundwater chemistry.Second, in addition to recording a positive correlation with precipitation, m 3 demonstrates a negative correlation with potential ET and irrigation and a positive correlation with groundwater depth.As discussed previously, CLUSTER 3 is generally distributed within the upstream area of the groundwater flow field, where elevation and groundwater depth are relatively high and temperature is relatively low.This explains the positive correlation with groundwater depth and the negative correlation with potential ET.Additionally, CLUSTER 3 mainly reflects the upstream stage of the common natural salinization process in semi-arid basins and has little agricultural impact (Table 4), which explains the observed negative correlation with irrigation.Third, in addition to the observed positive correlation with precipitation, m 4 positively correlates with infiltration, which is consistent with the finding that CLUSTER 4 is characterized by fertilization-impact recharge (Table 4).Fourth, m 5 records a positive correlation with the two ET variables and a negative correlation with unsaturated zone (UZ) recharge.In the HPLs of CLUSTER 5, where the climate is dry and hot, the ET process draws water from the shallow aquifer and soil; the downward percolation of water is limited, which leads to increasing concentrations of chemicals in groundwater.Finally, m 2 shows no significant correlation with any of the selected variables, which is consistent with the results in Table 4 and reflects the transitional role of CLUSTER 2 in the natural salinization process of groundwater moving from upstream to downstream in semi-arid basins.It is evident that by introducing independent hydrological data, the correlation analysis further validated the clustering results.An important implication of this conclusion is that with the GMM clustering analysis conducted in this study, a single sampling campaign can provide ample information about the hydrochemical processes of groundwater.

Conclusions
This study applies GMM clustering to a hydrochemical dataset of groundwater collected in the middle Heihe River Basin (HRB) in northwestern China.To validate these clustering results, a regionalization of groundwater chemistry was attempted, and independent hydrological data were introduced to perform correlation analyses.The GMM clustering results and the regionalization results based on this clustering provide ample information about the hydrochemical processes in the study area.The main findings include the following.First, in the middle HRB, groundwater chemistry demonstrates a typical natural salinization process moving from upstream to downstream, in which this natural process is further complicated by anthropogenic factors.Second, regional hydrological processes, especially surface water-groundwater interactions, can exert a profound and spatially varied impact on groundwater chemistry.Third, the interaction between human activity (which, in this study area, is mainly defined as agricultural development) and groundwater quality is complicated.In some farmlands, groundwater quality may be significantly degraded by agricultural development, whereas in others, groundwater quality may constrain this agricultural development.
Overall, this study demonstrates that the GMM can be effectively used in clustering to address hydrochemical groundwater datasets.Additionally, GMM clustering can provide insights into hydrochemical processes, even with a limited number of observations (e.g., the data collected in a single sampling campaign, as in this study).In many remote inland areas, it is technically and financially difficult to implement long-term high-frequency groundwater monitoring, and even a single sampling campaign can be very costly.Therefore, using the clustering approach developed in this study is a cost-effective way to investigate the groundwater chemistry in such areas.Future studies may address the following important issues: (1) dealing with multiple periods of data in the GMM clustering; (2) fusing hydrochemical and hydrological data in the GMM clustering; and (3) automating the selection of the optimal model.

Figure 1 .
Figure 1.An example of the weakness of heuristic clustering methods.(a) Model-based methods such as Gaussian Mixture Model (GMM) can separate the two distinctive relationships of the two attributes, but (b) heuristic methods such as k-means clustering cannot.

Figure 1 .
Figure 1.An example of the weakness of heuristic clustering methods.(a) Model-based methods such as Gaussian Mixture Model (GMM) can separate the two distinctive relationships of the two attributes, but (b) heuristic methods such as k-means clustering cannot.

Figure 2 .
Figure 2. The study area.(a) Its location in the Heihe River Basin; (b) the groundwater sampling sites; and (c) the groundwater table and flow.

Figure 2 .
Figure 2. The study area.(a) Its location in the Heihe River Basin; (b) the groundwater sampling sites; and (c) the groundwater table and flow.

3 −
contents.In contrast, CLUSTER 5 has very high salinity dominated by SO 4 2− and Cl − .CLUSTER 2 and CLUSTER 4 are between the extremes and are not well partitioned in the anion space because NO 3 − was not included in the Piper diagram.The transition of CLUSTER 3 → CLUSTER 1 → CLUSTER 2 → CLUSTER 5 reflects

Figure 4 .
Figure 4.The Piper diagram for the groundwater samples.

Figure 4 .
Figure 4.The Piper diagram for the groundwater samples.

Figure 5 .
Figure 5. Spatial patterns of the membership probabilities and the clustering uncertainty.(a-e) the probabilities of the samples belonging to CLUSTERs 1 to 5, respectively, and (f) the clustering uncertainty.

Figure 5 .
Figure 5. Spatial patterns of the membership probabilities and the clustering uncertainty.(a-e) the probabilities of the samples belonging to CLUSTERs 1 to 5, respectively, and (f) the clustering uncertainty.

Figure 6 .
Figure 6.Spatial distributions of the clustered samples based on (a) GMM clustering; and (b) k-means clustering.

Figure 7 .
Figure 7. Regionalization of the groundwater chemistry.Based on the GMM clustering, the study area can be divided into five sub-regions.

Figure 6 .
Figure 6.Spatial distributions of the clustered samples based on (a) GMM clustering; and (b) kmeans clustering.

Figure 6 .
Figure 6.Spatial distributions of the clustered samples based on (a) GMM clustering; and (b) k-means clustering.

Figure 7 .
Figure 7. Regionalization of the groundwater chemistry.Based on the GMM clustering, the study area can be divided into five sub-regions.

Figure 7 .
Figure 7. Regionalization of the groundwater chemistry.Based on the GMM clustering, the study area can be divided into five sub-regions.

Table 1 .
Correlation coefficient matrix of eight ions and TDS (log-transformed concentrations).
Note: Coefficient values over 0.9 are in bold.The underlined numbers indicate insignificant correlations (p-value > 0.05).

Table 2 ,
PC1 has high positive loadings in Na + , K + , Mg 2+ , Ca 2+ , Cl − and SO 4 2− but very low loadings in HCO 3 − and NO 3 − .Additionally, PC1 has a strong linear relationship with TDS (R 2 = 0.956).Therefore, PC1 is likely an indicator of salinity and can be used to represent rock-groundwater interactions.PC2 has very high positive loadings in NO 3 − and HCO 3 − and negative loadings in SO 4 2− , Na + and Cl − .As previously mentioned, NO 3

Table 2 .
PCA loadings in different ions and variances explained by different PCs.

Table 3 .
Twelve candidate models and their associated likelihoods and BIC values.

Table 4 .
Main characteristics of the five clusters with regard to salinity level and impacts of irrigation and fertilization.
Note: * "Limiting" refers to the effect that groundwater salinity limits the agricultural development."Degrading" indicates the effect that fertilization enhances groundwater salinity.

Table 4 .
Main characteristics of the five clusters with regard to salinity level and impacts of irrigation and fertilization.

Table 5 .
Spearman's rank correlation coefficients between hydrological variables and membership probabilities *.Note: * Coefficients with absolute values over 0.23 indicate significant correlation at the 95% confidence level and are bolded and underlined in the table.