A Simple Approach of Groundwater Quality Analysis, Classiﬁcation, and Mapping in Peshawar, Pakistan

: Groundwater is an important source of water for drinking, agriculture, and other household purposes, but high population growth, industrialization, and lack of oversight on environmental policies and implementation have not only degraded the quality but also stressed the quantity of this precious source of water. Many options existed, but this study evaluated, classiﬁed, and mapped the quality of groundwater used for potable consumption with a simple approach in an urban area (Peshawar valley) of Pakistan. More than 100 groundwater samples were collected and analyzed for physio-chemical parameters in a laboratory. Hierarchal clustering analysis (HCA) and classiﬁcation and regression tree (CART) analysis were sequentially applied to produce potential clusters / groups (groundwater quality classes), extract the threshold values of the clusters, classify and map the groundwater quality data into meaningful classes, and identify the most critical parameters in the classiﬁcation. The HCA produced six distinct potential clusters. We found a high correlation of electrical conductivity with total hardness ( R 2 = 0.72), alkalinity ( R 2 = 0.59) and chloride (cid:16) R 2 = 0.64 (cid:17) , and, total hardness with chloride ( R 2 = 0.62), and alkalinity ( R 2 = 0.51). The CART analysis conclusively identiﬁed the threshold values of the six classes and showed that total hardness was the most critical parameter in the classiﬁcation. The majority of the groundwater was either with worse quality or good quality, and only a few areas had the worst groundwater quality. This study presents a simple tool for the classiﬁcation of groundwater quality based on several aesthetic constituents and can assist decision makers develop and support policies and / or regulations to manage groundwater resources.


Introduction
Groundwater plays a vital role in the water cycle, preserves a healthy ecosystem balance [1], and acts as one of the primary sources of drinking water [2], irrigation [3], and inflow to many streams or rivers [4,5]. However, uncontrolled and unregulated exploitation of groundwater [6], high population growth [7][8][9], rapid urbanization [10], expanding agriculture [11], and increased industrial development [12] is deteriorating this resource. Approximately 35% of the global population faces conductivity (EC; µS/cm), total dissolved solids (TDS; mg/L), bicarbonate alkalinity (HCO 3 ; mg/L as CaCO 3 ), total hardness (mg/L as CaCO 3 ), calcium hardness (Ca; mg/L as CaCO 3 ), magnesium hardness (Mg; mg/L as CaCO 3 ), turbidity (Nephelometric Turbidity Unit/NTU), nitrate (mg/L), and chloride (mg/L) [41] in the Peshawar, which is the capital and most populated district of the province, utilize them in the clustering analysis, and classify the quality of groundwater into different and useful classes. We hypothesize that a useful tool for identifying groundwater quality groups can be developed from HCA and CART using easily obtained groundwater parameters. This would help the water-related agencies to act timely and efficiently in the worst groundwater quality areas as well as could be useful for planning a more detailed field classification when developing new groundwater supplies.

Study Area
Peshawar (34 • 00 56.2" N, 71 • 34 39.8" E) is the most populated district and capital of the Khyber Pakhtunkhwa province in Pakistan. It spreads over an area of 1257 km 2 [42] and has been divided into four towns (Town 1, Town 2, Town 3, Town 4) and a cantonment area which occupy 3%, 16%, 35%, 45%, and 1% of the total area, respectively [43]. These four towns are further subdivided into 83 union councils ( Figure 1). The total population of the area is 4.269 million, wherein 2.299 million are living in rural areas and 1.970 million in urban areas [44]. The study area's mean annual minimum and maximum temperatures are 1.0 • C and 42.3 • C [45], and the mean annual rainfall is 545.9 mm [46]. District Peshawar is the part of the large Peshawar basin, which has broad and oval-shaped depressions of fluvial, deltaic, and lacustrine sediments. The sediments are mainly gravel, sand, silt, and clay and form a moderate thick and productive aquifer in the north and center, approximately 50 to 100 m 3 /h yield while in the southern parts, the thickness of the aquifer is less and the yield is approximately 10-50 m 3 /h ( Figure 2). These aquifers are mainly recharged by precipitation, surface water reservoirs, rivers, and irrigation network [47,48]. The underground rocks in the adjacent regions (eastern and southern) are limestones, shale, or calcareous quartzite [49]. Depth to groundwater table varies from 5 to more than 50 m from north-east to south-west [33].

Groundwater Sampling and Determination of Physio-Chemical Properties
A field survey was conducted in June 2012 to collect 105 groundwater samples from public drinking places, such as tube wells, dug wells, and hand pumps. These samples were collected in 250 mL plastic bottles from areas where depth to groundwater table was less than 5 m in the north while more than 50 m in the south-west. Before sampling, the bottles were thoroughly washed with distilled water to avoid the contamination. At least one ground sample was collected from each union council. However, few union councils, such as Adezai, Aza khel, Mashogagar, Mian Gujjar, Khatiki, and Joganaie, were inaccessible. Location (latitude and longitude) of each groundwater sample was recorded using a handheld global positioning system (GPS).
HACH Multimeter (Model SENSION 156) was used to measure pH, EC, and TDS of all 105 groundwater samples. However, before measurement, the Multimeter was calibrated with three buffer solutions (a solution with constant H) such as 4, 7, and 10 to ensure the accurate readings. Turbidity was measured using HACH Turbidity meter (2100 N), Ca and Mg hardness, total hardness and chloride with a titration method, while the nitrate concentration was determined through spectrophotometer (Thermo Spectronic, Genesys 5) [41].

The Clustering Analysis
Two types of clustering analysis, namely, hierarchal clustering analysis (HCA) and classification and regression tree (CART), were sequentially applied on the groundwater quality data to produce the potential clusters (groundwater quality classes) as well as to identify the most critical water quality parameters and their exact threshold values for the classification and mapping of groundwater quality. HCA is an unsupervised classification method which merges (agglomerative method) or splits (divisive method) the individual observations in a large group based on similar or dissimilar proximity measure, i.e., Euclidian distance (Equation (1)) [50].
where, is the Euclidian distance between observation , and are the values of the m th variable.
In both cases, the HCA produces number of clusters. Therefore, it is important to predetermine the optimal number of clusters ( ) [51]. The with-in cluster variation (sum of square) for 1-15 clusters was calculated to decide the as it provides excellent results with HCA [52]. The data was normalized (range-equalization) prior to Euclidian distance calculation because the parameters were in different units and treating them, in the same way, would produce uncertain results because one parameter would have more influence over other parameters [53]; thus, each value was normalized to a scale of 0 to 1 (Equation (2)), according to their minimum and maximum values ( , ) ( Table 1).

The Clustering Analysis
Two types of clustering analysis, namely, hierarchal clustering analysis (HCA) and classification and regression tree (CART), were sequentially applied on the groundwater quality data to produce the potential clusters (groundwater quality classes) as well as to identify the most critical water quality parameters and their exact threshold values for the classification and mapping of groundwater quality. HCA is an unsupervised classification method which merges (agglomerative method) or splits (divisive method) the individual observations in a large group based on similar or dissimilar proximity measure, i.e., Euclidian distance (Equation (1)) [50].
where, d kl is the Euclidian distance between observation k and l, X km and X lm are the values of the m th variable. In both cases, the HCA produces n number of clusters. Therefore, it is important to predetermine the optimal number of clusters (OC) [51]. The with-in cluster variation (sum of square) for 1-15 clusters was calculated to decide the OC as it provides excellent results with HCA [52].
The data was normalized (range-equalization) prior to Euclidian distance calculation because the parameters were in different units and treating them, in the same way, would produce uncertain results because one parameter would have more influence over other parameters [53]; thus, each value X was normalized to a scale of 0 to 1 (Equation (2)), according to their minimum and maximum values (X min , X max ) ( Table 1).
After the normalization, function hclust included in package fastcluster [54] was applied, and the potential clusters (OC) were produced.
Since we were interested in identifying the most important water quality parameter and their exact threshold values for classification and mapping, CART analysis was used, and function rpart of the package Recursive Partitioning and Regression Trees (rpart; [40]) was applied. CART is a supervised classification method where the water quality parameters were used as explanatory variables, and the response variable was the potential clusters produced from HCA. CART uses all explanatory variables (pH, EC, TDS, alkalinity, total hardness, Ca, Mg, turbidity, nitrate, and chloride), calculates and ranks the impurity (heterogeneity) before and after each possible split (midpoint) for each explanatory variable, uses the value of that explanatory variable, which greatly reduces the heterogeneity, and divides the data into two homogeneous groups [55]. This recursive process is then applied on either side, and, at the end, a large tree is produced, which contains the exact threshold values of the explanatory variables at each split (node).

Classification and Mapping of the Groundwater Pollution
The threshold values of the water quality parameters (explanatory variables) at each node of the classification tree produced from CART analysis were used to extract and separate the whole data into those optimum clusters (OC; groundwater quality classes). The data were then imported to ArcGIS (version 10.4), and the union councils and water quality data were joined based on spatial location. Joining data based on spatial location has the advantage that inaccessible union councils with no groundwater sample get the ranking of the nearest sampling point, but the categorization of such inaccessible union councils can be considered as broad. This way, we used the identified classes and developed a thematic map of the groundwater quality; however, we assigned a high rank to those union councils that contained two or more groundwater samples with different groundwater quality classes. Table 1 shows a summary of the physio-chemical properties of the groundwater samples in the study area. The minimum, mean, and maximum values of the pH However, the later five parameters (total hardness, turbidity, TDS, Ca, and Mg hardness) contained groundwater samples with values higher than the drinking water quality standards. The mean values of the EC (809.44 µS/cm), which were much higher than the standard values (300-500 µS/cm), indicate salt enrichment in the study area [56]. The high level of such contaminants could be attributed to the domestic wastewater contamination from chemicals and woolen mills, soap, and marble industries [57] or the underground geology of the adjacent regions in which limestone/calcareous quartzite are the most common [49,58]. The results show that groundwater is safe for drinking in terms of nitrate concentration, which is the most important parameter in water quality assessment because its high concentration in drinking water affects bottle-fed infants, for example, cyanosis (discoloration of the skin) and methemoglobinemia [18].

Clustering Analysis of the Groundwater Data
Many researchers have used the DRASTIC [29] or GIS-based DRASTIC models (overlay and indexed method) [30][31][32][33], MOUSE and RUSTIC models (process based simulation methods) [36,37] for water quality assessment while other have applied the clustering analysis (statistical methods) [32,[60][61][62] in which they simply separated different clusters/groups based on similarity or dissimilarity. The clustering analysis is also used in this study, but the interest was to find out the exact threshold values that separate those clusters, and this makes it different than the other studies. Two types of clustering analysis such as HCA: an unsupervised statistical method which uses similarity, dissimilarity, or distance metrics and classify a large group of observations into several clusters [63][64][65] and CART: a statistical technique, which is used to select the most important parameters and their interactions in the determination of an outcome [40] were sequentially applied with a simple approach. First, HCA was used to separate the potential clusters (groundwater quality classes), and then, the CART analysis was used to extract the exact threshold values of those potential clusters for the classification and mapping of groundwater quality and identification of the most important parameters in the disintegration of clusters [51]. In HCA, it was challenging to decide the optimum number of clusters; however, the within-cluster variation (sum of square), which produces satisfactory results with hierarchal clustering [52], was used and calculated from 1 to 15 clusters (Figure 3). A high variation (within groups sum of square) in the water quality existed from cluster 1 to 6, but from cluster 6 onwards, the variation was small. Thus, we decided to use an optimum number of clusters OC = 6.
Then, the HCA [54] was applied on water quality data, which included all the water quality parameters -pH, EC, TDS, HCO 3 , total hardness, Ca, Mg, turbidity, nitrate, and chlorideand six potential clusters/groups were produced, wherein, some groups were clearly distinguishable on the scatterplot while others had some degree of overlaps (Figure 4). Only the most correlated parameters (EC, alkalinity, and chlorides) are presented on the scatterplot as it was expected that these parameters would play an important role in the disintegration of the potential clusters/groups (see below). A high correlation of EC with chloride (R 2 = 0.64) and alkalinity (R 2 = 0.59), and total hardness with alkalinity (R 2 = 0.62) ( Figure 5) was found. Thus, EC and total hardness could be the key parameters in the water quality assessment [17]. A similar high relationship was also found by [66] during the physio-chemical analysis of groundwater in Chithar Riber Basin, India. The high correlation of the EC with chloride and alkalinity also shows that the salinity in the study area's groundwater is controlled by chlorides and bicarbonates [60]. This could probably be derived from weathering of rock salts and gypsum-bearing aquifers [67]. Also, the sewage water, which contains untreated effluents from small industries, such as marble, soap, and food processing, is used for agriculture purposes in the study area, and it contaminates the crops as well as the groundwater [68]. Pakistan council of research in water resources [69] has described that the groundwater of Peshawar is dominated by Mg and Ca cations as well as chloride and bicarbonate anions; however, our results showed the most of them are within the permissible limits. Then, the HCA [54] was applied on water quality data, which included all the water quality parameters -, , , , ℎ , , , , , and ℎ -and six potential clusters/groups were produced, wherein, some groups were clearly distinguishable on the scatterplot while others had some degree of overlaps (Figure 4). Only the most correlated parameters ( , , and ℎ ) are presented on the scatterplot as it was expected that these parameters would play an important role in the disintegration of the potential clusters/groups (see below). A high correlation of with ℎ (R 2 = 0.64) and (R 2 = 0.59), and ℎ with (R 2 = 0.62) ( Figure 5) was found. Thus, and ℎ could be the key parameters in the water quality assessment [17]. A similar high relationship was also found by [66] during the physio-chemical analysis of groundwater in Chithar Riber Basin, India. The high correlation of the with ℎ and also shows that the salinity in the study area's groundwater is controlled by chlorides and bicarbonates [60]. This could probably be derived from weathering of rock salts and gypsum-bearing aquifers [67]. Also, the sewage water, which contains untreated effluents from small industries, such as marble, soap, and food processing, is used for agriculture purposes in the study area, and it contaminates the crops as well as the groundwater [68]. Pakistan council of research in water resources [69] has described that the groundwater of Peshawar is dominated by and cations as well as chloride and bicarbonate anions; however, our results showed the most of them are within the permissible limits.     Figure 6. Classification tree as a result of CART analysis. Each node/split has a threshold value of the water quality parameter, which was used by CART analysis to split the data into homogenous clusters/groups (groundwater quality classes). 1: Worst quality, 2: Worse quality, 3: Bad quality, 4: Good quality, 5: Better quality, and 6: Best quality. After the HCA, the exact threshold values of the groundwater quality parameters, which separated the clusters, were extracted, and the classification tree, which was produced as a result of the CART analysis, is shown in Figure 6. As expected, the first split/node was based on total hardness ≥ 265 mg/L as CaCO 3 , which separated the whole data into two groups. The next splits were based on EC < 1432 µS/cm, chloride ≥ 35 mg/L, and EC ≥ 939 µS/cm. Moreover, on the other side of the classification tree, the split was based on nitrate ≥ 18 mg/L. CART analysis produced six potential groups/clusters, identified the threshold values of water quality parameters and showed that total hardness was the most important parameter in the disintegration of the potential clusters/groups as the first split was based on the total hardness [40]. CART analysis performed smart decisions and included the high correlated parameters, which were expected to be important in the classification of groundwater quality into useful classes.

Classification and Mapping of the Groundwater Pollution
The threshold values of the water quality parameter at each node/split of the classification tree as identified by CART analysis (Figure 6) were used to divided the whole data into six classes and were named according to the level of concentration of the water quality parameters, such as worst quality (#1), worse quality (#2), bad quality (#3), good quality (#4), better quality (#5), and best quality (#6). Figure 7 shows the mean values of the groundwater quality parameters of each groundwater quality class and their standard deviations, wherein, the values of the parameters, particularly, the EC, alkalinity, and chloride, are clearly decreasing from the worst quality class (#1) to the best quality class (#6) (Figure 7a). The thematic map of the final classes of the groundwater quality in the study area ( Figure 8) shows that the worst (#1) and worse (#2) groundwater quality areas (19 union councils) are located in the north, center, south, and south-eastern parts of the study area, and the bad quality (#3) areas are Urmar Miana and Chaghar Matti located at the north and south, respectively. The low groundwater quality at the north and north-eastern parts (worst (#1) and worse (#2) quality areas) are probably due to the fertilizers and pesticides applications in the agricultural fields or domestic and industrial wastewater, which drains into the river in these areas [59] coupled with the lowest groundwater table (less than 1.5 m), sandy soil type, and gravel and sand aquifer material where the contaminants require the shortest travel time to reach the groundwater and move easily from one place to another [33]. Most of the area is under good groundwater quality (#4) (eastern, central, south-eastern areas and some parts of the south and north-east) where the water table is more than 50 m deep [70] and the soil type is silt-loam which helps to restrict or minimize the movement of the contaminants [33]. Three union councils such as Hayatabad, Malakandir, and Bazid Khel have better groundwater quality (#5), and only the Shahi Bala union council has the best groundwater quality (#6). This categorization can be considered as broad in the union councils such as Sher Kera, Aza Khel, and Shekhan (south-west), and Khatki, Mian Gujjar, Lala, and Jhagra (north-east) where the data was not available, and a detailed water quality analysis will be recommended before planning a water supply scheme. The thematic map produced as a result of the clustering analysis has a similar spatial distribution of the groundwater quality as the final vulnerability map in [33], in which they have assessed the groundwater vulnerability to pollution using a completely different method (GIS-based DRASTIC model). It is also interesting to compare the thematic map of groundwater quality classes (Figure 8) with the spatial distribution maps of different water quality parameters in the study area [12]. This method was able to identify and separate all areas with a high concentration of the water quality parameters as worse or worst groundwater quality class. The groundwater quality spatial distribution could further be improved if additional water quality parameters are included in the analysis; however, the field survey and laboratory analysis are the most tedious job and depend on the time and cost involved [71]. It would also be more interesting to perform a spatiotemporal analysis of the groundwater quality in the study area to observe a periodical change in the groundwater quality and validate the performance of the method used. We believe that the clustering analysis used in this study would be a simple and robust way for water-related agencies to classify, map, and prioritize groundwater quality.  Figure 6. Classification tree as a result of CART analysis. Each node/split has a threshold value of the water quality parameter, which was used by CART analysis to split the data into homogenous clusters/groups (groundwater quality classes). 1: Worst quality, 2: Worse quality, 3: Bad quality, 4: Good quality, 5: Better quality, and 6: Best quality.

Conclusions
In this study, a useful tool that could be applied to any area was developed from HCA and CART analysis to identify different groundwater quality classes. Simple and easily obtained groundwater quality parameters, such as , , , , ℎ , , , , , and ℎ were determined and used as input in the tool. This tool could be used as a groundwater quality identifier before developing new water supplies; however, adding additional groundwater quality parameters will further improve the results. The tool can also be used for prioritizing the groundwater quality regions, and its validity can be assessed by planning a detailed spatiotemporal analysis of the groundwater quality in the same or any other region.

Conclusions
In this study, a useful tool that could be applied to any area was developed from HCA and CART analysis to identify different groundwater quality classes. Simple and easily obtained groundwater quality parameters, such as pH, EC, TDS, alkalinity, total hardness, Ca, Mg, turbidity, nitrate, and chloride were determined and used as input in the tool. This tool could be used as a groundwater quality identifier before developing new water supplies; however, adding additional groundwater quality parameters will further improve the results. The tool can also be used for prioritizing the groundwater quality regions, and its validity can be assessed by planning a detailed spatiotemporal analysis of the groundwater quality in the same or any other region.