Comparative Study of Hydrochemical Classification Based on Different Hierarchical Cluster Analysis Methods

Traditional methods for hydrochemical analyses are effective but less diversified, and are constrained to limited objects and conditions. Given their poor accuracy and reliability, they are often used in complement or combined with other methods to solve practical problems. Cluster analysis is a multivariate statistical technique that extracts useful information from complex data. It provides new ideas and approaches to hydrogeochemical analysis, especially for groundwater hydrochemical classification. Hierarchical cluster analysis is the most widely used method in cluster analysis. This study compared the advantages and disadvantages of six hierarchical cluster analysis methods and analyzed their objects, conditions, and scope of application. The six methods are: The single linkage, complete linkage, median linkage, centroid linkage, average linkage (including between-group linkage and within-group linkage), and Ward’s minimum-variance. Results showed that single linkage and complete linkage are unsuitable for complex practical conditions. Median and centroid linkages likely cause reversals in dendrograms. Average linkage is generally suitable for classification tasks with multiple samples and big data. However, Ward’s minimum-variance achieved better results for fewer samples and variables.


Introduction
Traditional methods for graphical analysis of hydrochemical data include Piper (trilinear) diagrams, scatter plots, quadrilateral diagrams, rhombus diagrams, triangle diagrams, Schuka Lev classification, Broski classification, Kurllov's (KypmoBa) formula, etc. [1][2][3][4][5]. Studies relying on one aforementioned method or measure may be susceptible to limited and biased results. For example, the classification of water samples using Piper diagrams tend to be vague and ineffective as it only plots a few major anions and cations [6,7]. The Schuka Lev classification has clear indices (for chemical constituents in groundwater) and a subjective predetermined threshold in milliequivalents (mEq) for ions. Therefore, this method obscures the fuzziness in water quality to some extent, and the variation of water quality is not detailed enough in classification results [8][9][10][11].
In recent years, cluster analysis (CA) and other multivariate statistical methods have been increasingly used in the classification of foundations. They can effectively extract useful information from complex datasets, and provide a reasonable and efficient approach to the study of chemical characteristics of groundwater [12,13]. The main factors affecting the hydrochemical field can be effectively identified using information regarding major ionic and nonionic components of groundwater that are extracted through multivariate statistical methods, which may further facilitate the understanding of the formation mechanism in the hydrochemical field [7,[14][15][16][17][18][19]. Furthermore, clustering methods provide comprehensive analysis of the hydrochemical properties and improve the rationality in hydrochemical analysis by showing the sources of recharge, hydraulic relations, transport laws of groundwater, and the interaction characteristics between groundwater and its surrounding environment to a certain extent [20][21][22].
Moreover, CA covers many topics and is flexible. There are many theories and techniques related to CA, which may be applied to various objects and conditions. If the selected technique is unsuitable for a task, characterization of the nature and internal laws of data will be difficult, and may produce results that deviate from reality and the original intention of research. Therefore, core issues that need to be urgently addressed are: (a) Selection of one or several clustering methods for analysis under specific conditions; (b) comparing the advantages and disadvantages of various methods; (c) approximation of actual object compositions and the reflection of the objective laws of data; (d) achieving the optimal process and results through CA.
Therefore, in this study we performed a CA on 19 groups of leakage water samples collected from the Bayi Tunnel in Chongqing (municipality directly under the Central Government) to investigate the internal relationship between the sample data using six hierarchical cluster analysis (HCA) methods, i.e., single linkage, complete linkage, median linkage, centroid linkage, average linkage (including between-groups and within-groups linkage), and Ward's minimum-variance. In addition, this study compared the advantages and disadvantages of the aforementioned methods and analyzed their objects, conditions, and scope of application.

General Setting of the Study Area
The Bayi Tunnel is located in between the Lianglukou Subdistrict and the Shangqingsi Subdistrict of Yuzhong District in Chongqing, Southwestern China. The entrance of the Bayi Tunnel is located in Jianxinpo, and the exit is at the southeast of the Chongqing Municipal Facilities Administration Bureau. This tunnel passes beneath the Chongqing Emergency Medical Center (CEMC), Chongqing Sports Bureau, and Lines 1 and 3 (Jianxinpo Tunnel) of the Chongqing Rail Transit. This tunnel was constructed in 1984, surrounded by roads in all directions. There is convenient daily traffic in its surrounding areas with dense flows of people and vehicles. It is an important tunnel in the Chongqing traffic hub. However, this tunnel has incurred water leakage and has other issues, partly because of the long service life, and partly because of intense human activities and complex natural conditions in its surrounding areas.
The soil in the study area is mainly composed of Quaternary gray brown clay and gray purple silty sand, mixed with gravel, with good hydraulic conductivity. The outcropping strata are fluvial and lacustrine sedimentary rocks, mainly composed of Jurassic fine sand and silty mudstone. The weathering fracture depth is generally 0.2-1.5 m. The groundwater is mainly distributed in the pores of Quaternary loose layer and weathered fissures of bedrock, which is mainly recharged by precipitation.

Sample Collections
After a rainfall event, a total of 19 water sample sets were collected: One sample set of underground sewer water (USW) from CEMC above the Bayi Tunnel; one set of precipitation (rain) samples from the atmosphere near the tunnel periphery; one sample set of the bedrock fissure water (BFW) and a set of pumping pipeline water (PPW) from superjacent Jianxinpo Tunnel; fifteen leakage water sample sets were collected from the Bayi Tunnel. Three sets of the fifteen were collected from the drain hole in the lining (at 272 m) of the Bayi Tunnel on three consecutive days. Twelve sets were collected on four consecutive days from three leakage points of the tunnel lining, at 327.5, 347, and 355 m, respectively.
Polyethylene bottles with 50-mL capacity were used as sample containers. The bottles were cleaned with distilled water before sampling and then rinsed 2 to 3 times with the water sample to be taken. Each sample set comprised two portions: A sample for cation analysis, to which dilute nitric acid (HNO 3 ) was added until its pH was less than 2; and the other sample for anion analysis, which was unprocessed. The sampling process was in line with the relevant specifications and requirements in the Guidance of Collection and Preservation of Groundwater Sample for Quality Control (DZ/T 0064.2-93).

Chemical Analyses
HCO 3 was measured in the field using a simple titration device with an analysis precision of 0.03 mmol/L (1.83 mg/L). The pH, temperature, and electrical conductivity (EC) measurements were conducted in-field using a Hanna HI8733 portable conductivity meter and Hanna HI8242 portable pH/mV meter, with the analysis precisions of 0.01 (pH), 0.1 • C (temperature), and 1 µs/cm (EC). Water samples were sent to the State Key Laboratory of Biogeology and Environmental Geology in China University of Geosciences (Wuhan) for cation and anion analyses in one week after the rainfall event. Cations were measured using inductively coupled plasma optical emission spectrometry (ICP-OES, IRIS Intrepid II XSP, Thermo Fisher Scientific, Waltham, MA USA) with a precision of 1 × 10 −3 mg/L, and anion analysis was performed using an ion chromatograph (IC, DX-120, Dionex, Sunnyvale, CA USA) with a precision of 0.01 mg/L (Table 1). In order to excavate the internal relationship between different water sample types, as well as the temporal transforming pattern from the same water sample type, these four water samples with missing value(s) were reserved for CA. Because the contents of these variables are lower than the detection limits, 0 was introduced to replace the no data in CA. The charge-balance error (CBE) was within ±5%, as the percentage relative total of the cation-anion difference was calculated on the sums from each water sample (Table 1). All analyses yielded analytical errors <5% and external precision of known-unknown analytical standards. To better ensure the quality of raw data, EC was also processed and calculated to compare with total dissolved solids (TDS) [23][24][25]. Unary linear regression equation of TDS(y) versus EC(x), y = 0.7117x, was extracted with R 2 = 0.9906. All procedures of sampling, preservation, and transportation to the laboratory were strictly conducted in accordance with standard methods [26].

Concept
CA is a multivariate statistical method that gradually classifies samples based on their similarity. It regards the samples as points in a multidimensional space, and the similarity between points are indicated using statistics [13,27]. Objects with a high degree of similarity are classified into a small cluster, while those with a low degree of similarity are classified into a large cluster. This classification continues until all data objects are classified. In CA, a data set is divided into several clusters, and the objects in the same cluster have a higher degree of similarity than those in other clusters [12,28,29]. CA is seen as a typical combinatorial optimization problem, which is described by the following mathematical model.
In a given set of pattern samples {X}, there are n samples and k classes of patterns {S j , j = 1,2, . . . , k}. Each sample contains m variables. The set X can be expressed by a matrix as: . , x mi denote the first, second, . . . , m-th variable of the i-th sample. To classify samples, the minimum distance between each sample and its cluster center is taken as the similarity or distance metric, and its objective function is: where k is the number of clusters; m j denotes the mean vector of the j-th sample (S j ); n i=1 y ij = 1, implying that the sample i is only assigned to a cluster center. The classification rule is that if i is assigned to j-th cluster center, then y ij = 1; or else, y ij = 0.

Hierarchical Cluster Analysis
Existing clustering algorithms mainly include hierarchical clustering, partitioning, density-based clustering, grid-based clustering, model-based clustering, and fuzzy clustering. In particular, hierarchical clustering consists of hierarchical decomposition of a given set of data objects. Each object is initially regarded as an individual cluster. Then, objects with the shortest distance are joined into a new cluster until all are joined together in one large cluster.
Depending on the definition of the nearest (neighbor) distance and the recursion equation for clustering, hierarchical clustering can be subdivided into single linkage, complete linkage, median linkage, centroid linkage, average linkage, and Ward's minimum-variance [30]. At present, hierarchical clustering is the most widely used clustering method. The related calculation and analysis modules have been integrated into many statistical analysis software packages or systems, such as SPSS, SAS, and S-PLUS, so that the users can directly invoke relevant functions.

Single Linkage
In single-linkage clustering, the two closest clusters are joined into a new cluster, and the shortest distance between members (in different clusters) is the distance between the new cluster and another cluster. Two clusters with the shortest distance are joined until one large cluster remains ( Figure 1). Let the distance between x i and x j , i.e., d(x i , x j ), be represented as d ij . Let G p and G q denote two clusters containing n p and n q objects, respectively. D(G p , G q ) or D pq represent the distance between clusters G p and G q . Let G r = {G p } represent the new cluster that G p and G q join into.
The distance between clusters G p and G q is defined as the distance between their closest members, which is referred to as the shortest distance. It is calculated as: After G p and G q are joined into a new cluster G r , the distance between G r and another cluster G k (k p, q) is calculated based on the single-linkage clustering using the formula below: This method joins two closest clusters into a new cluster and takes the longest distance between its members as the distance between the new cluster and another cluster. Among the farthest-apart members, two clusters that have the shortest distance are joined until all members are in the same cluster ( Figure 2). The distance between clusters G p and G q is defined as the distance between their farthest-apart members, which is referred to as the longest distance. It is calculated as: After G p and G q are joined into a new cluster G r , the distance between G r and another cluster G k (k p, q) is calculated using the complete-linkage clustering through the following formula: The shortest and longest distances in single and complete linkages represent two extremes in distance measurement. In contrast, median linkage uses an approach that falls within the shortest and complete linkages for calculating the distance between clusters (Figure 3). After G p and G q join into a new cluster G r , the distance between G r and another cluster G k (k p, q) is calculated based on median linkage using the equation below: where β is often set to β = − 1 4 . Here, D rk is the midsegment across the side D pq of the triangle formed by D pk , D qk , and D pq .

Centroid Linkage
From a physical perspective, representing a cluster with its centroid is more reasonable. In centroid linkage, the distance between the centroids of two clusters is used to measure the distance between clusters. The distance between clusters is defined as the distance between their centroids. In object classification, the centroid for a cluster is considered to be the mean value of objects in that cluster ( Figure 4). After G p and G q are joined into a new cluster G r , they contain n p , n q , and n r (n r = n p + n q ) objects, respectively. Their centroids are denoted as X , and X (r) , respectively. We obtain: The distance between G r and another cluster G k (k p, q) is: Average linkage considers the average distance between members in two clusters, which can be further subdivided into two types: Between-groups linkage and within-groups linkage. When calculating the distance between clusters, between-groups linkage considers the average distance between members in different clusters, while within-groups linkage considers the distance between all members.
The distance between G p and G q is defined as the average distance between their member pairs, which is referred to as the average distance between clusters. It is calculated as: The distance between the new cluster G r and another cluster G k (k p, q) is calculated as: a.

Between-groups linkage
This method defines the distance between two clusters as the average distance between their member pairs, and the two members are from different clusters. At each step, two clusters with the shortest average distance are merged until all members are joined into a large cluster ( Figure 5). In other words, the average distance between each member pairs of two clusters is the shortest after they merge into a new cluster using between-groups linkage. Within-groups linkage This method defines the distance between two clusters as the average distance between any two members of the clusters, including the distance between any two members, irrespective of the cluster. At each step, two clusters with the shortest average distance are merged until all members are joined into a large cluster ( Figure 6). This means that after two clusters merge into a new cluster, the average distance between their members in the new cluster is minimized.

Ward's Minimum-Variance
This method is based on the analysis of variance (ANOVA). For the correct classification, the ANOVA results show small within-groups sum of squares and large between-groups sum of squares.
Assuming that n samples are categorized into k groups, the i-th sample in the cluster G t is denoted as X (t) i , and n t represents the number of samples in G t . Let the centroid of the cluster be X (t) . Then, the sum of squares within G t is: The total sum of squares for k groups is: In Ward's minimum-variance method, n samples are initially considered as separate clusters. Each time two clusters merge, the number of clusters decreases by one, and S increases. At each step, the two clusters are merged, resulting in the least increase of S, until all samples are joined into the same cluster.
The distance between G p and G q is defined as the sum of squares between the two clusters: The distance between the new cluster G r and another cluster G k (k p, q) is calculated as: D 2 (G r , G k ) = n k + n p n r + n k D 2 kp + n k + n q n r + n k D 2 kq − n k n r + n k D 2 pq

Data Standardization
Because the observed values of each variable of samples have different orders of magnitude and measurement units, data transformations are necessary to obtain dimensionless data to avoid inefficient classification and improve the classification accuracy. After utilizing Z-scores to standardize raw data, the mean value of the transformed data was 0, and the standard deviation was 1 (standard normal distribution) in this study ( Table 2):

Euclidean Distance
The distance is often used as a quantitative indicator for the degree of similarity between samples. Each sample is regarded as a point in an m-dimensional space. By defining a certain distance between points in m-dimensional space, we can classify the closer points to the same cluster and farther ones into different clusters. This study uses Euclidean distance (Table 3):  All calculations and classification results in this study are obtained using SPSS (IBM, Amonk, NY, USA).

Single Linkage Method
According to Figure 7, if a line is drawn (Line A) at the Euclidean distance of 2.33, 6 clusters are made: Water leaked from the Bayi Tunnel, running water from the drain hole, BFW and PPW from the Jianxinpo Tunnel, and rain and USW from the CEMC. At the distance of 4.76, three clusters were formed, while only one large cluster existed at the distance of 6.871. If a line (Line B) was drawn at the distance of 2.643, leaked water from the tunnel and the running water from the tunnel drain hole would join into a cluster, indicating a correlation between the two. However, these two types of water samples were distinguished at a distance less than 2.643, showing difference between the running water through the tunnel drainage system and the water in the hydrochemical process during leakage.

Complete Linkage Method
According to Figure 8, if a line (Line B) is drawn at the Euclidean distance of 3.691, six clusters are made, four clusters at the distance of 5.551 (Line C), while only one large cluster at the distance of 8.881. At a distance of 5.551, water leaked from the tunnel and the running water from the tunnel drain hole were joined, indicating a certain correlation between water leaked from different parts of the tunnel. At the distance of 2.9 (Line A), water leaked from the tunnel was clearly divided into three types: (a) The running water from the tunnel drain hole at +272 m; (b) water leaked near the point at +327.5 m; and (c) water leaked near the point at +355 m. The gradual changes in hydrochemistry of water samples with different sampling locations were reflected in the clustering process and the dendrogram.

Median Linkage Method
Single linkage underestimated the distance between clusters, while complete linkage exaggerated the distance between clusters. Median linkage provided an approach that fell within the scope of these two linkages. According to Figure 9, if a line (Line A) is drawn at a Euclidean distance of 2.062, six clusters are formed: Water leaked from the Bayi Tunnel; the running water from the drain hole in the tunnel; BFW and PPW from the Jianxinpo Tunnel; and rain and USW from the CEMC. At a distance of 3.614 (Line B), three clusters were formed: One cluster included the water leaked from the tunnel, the running water from the tunnel drain hole, and BFW and PPW from the Jianxinpo Tunnel. One cluster only included rain, while another cluster only included USW. This result suggests the composition difference between rain from the atmosphere and USW of the CEMC. In contrast, there was only one large cluster at a distance of 5.567.

Centroid Linkage Method
From a physical perspective, it is more reasonable to represent a cluster with its centroid. In centroid linkage, the distance between the centroids of two clusters is used to represent the distance between clusters. In object classification, the centroid for a cluster is considered to be the mean of objects in that cluster.
According to Figure 10, if a line (Line A) is drawn at a Euclidean distance of 2.626, five clusters are formed: Water leaked and the running water from the drain hole in Bayi Tunnel; BFW from the Jianxinpo Tunnel; PPW from the Jianxinpo Tunnel; rain; and USW from the CEMC. In median linkage, water leakage from the tunnel and the running water from the drain hole were considered as two different types of water. This differentiation reflects a slight difference between median linkage and centroid linkage, though they were joined at a different distance in centroid linkage. At a distance of 4.163 (Line B), three clusters were formed, which is consistent with the classification results of median linkage. Specifically, one cluster included water leaked from the tunnel, the running water from the drain hole in the tunnel, and BFW and PPW from Jianxinpo Tunnel. One cluster only included rain, while another cluster only included USW of the CEMC. The above results show the similarities between centroid linkage and median linkage. In contrast, there was only one large cluster at a distance of 5.793.

Between-Groups Linkage
According to Figure 11, if a line (Line A) is drawn at an average Euclidean distance of 2.916, the 19 samples will be categorized into six clusters: Water leaked from the Bayi Tunnel; the running water from the drain hole in the tunnel; BFW from the Jianxinpo Tunnel; PPW from the Jianxinpo Tunnel; rain; and USW from the CEMC. At a distance of 4.401 (Line C), 4 clusters were formed. One cluster included the water leaked from the tunnel, the running water from the drain hole in the tunnel, and the BFW from the Jianxinpo Tunnel. One cluster included the PPW from the Jianxinpo Tunnel, while another cluster included rain and USW from the CEMC. In contrast, only one large cluster existed at a distance of 7.553. Figure 11. Dendrogram of data through between-groups linkage.

Within-Groups Linkage
According to the dendrogram in Figure 12, 19 groups of samples were classified into three clusters at a distance of 3.316 (Line B). One cluster included the water leaked from the tunnel, PPW from Jianxinpo Tunnel, and rain. This classification suggests that the water loss from leakage in the Jianxinpo Tunnel and the Bayi Tunnel may be replenished through rainfall. One cluster included the running water from the drain hole in the Bayi Tunnel and the BFW from the Jianxinpo Tunnel. This indicates a connection between the two and a certain hydraulic relation in rock mass between the two tunnels. Another cluster only included the USW from the CEMC. It showed poor connection with other types of water samples, which were observed in results with other methods. This is because USW is human sewage or wastewater with complex composition, which is completely different from the composition of water samples that are naturally produced.

Ward's Minimum-Variance Method
According to the dendrogram in Figure 13, if a line (Line B) is drawn at the sum of squares of 27.467, the 19 groups of water samples will be classified into two large clusters: A cluster with only the water leaked from Bayi Tunnel, and the other cluster with other water samples. The 19 groups of water samples could be further classified into six clusters at the sum of squares of 10.837 (Line A): Water leaked near the point at +327.5 m; water leaked near the point at +355 m; the running water from the drain hole; BFW and PPW from the Jianxinpo Tunnel; rain and USW from the CEMC.

Single Linkage Method
In Figure 7, the leaked water from the tunnel only joins BFW from the Jianxinpo Tunnel and rain at distances of 4.76 (Line C) and 5.357 (Line D), respectively. This indicates the absence of a close direct correlation and the significant effects of delayed or lagged rainfall. The water leaked from the tunnel finally joined USW at the late stage of clustering, showing composition differences between water samples. It is inferred that the pipeline was unlikely to be the source of water leak.
The single linkage method is simple and easy to use, which reflects the basic idea of hierarchical clustering in the most intuitive way. The obtained clustering results were consistent with the water samples determined at the initial sample collection stage. This finding suggests that without external influence and interference, single-linkage clustering showed great performance in data classification and characterization, and could be used to produce relatively clear and accurate clustering results.
However, owing to its inherent limitations in methodology, the closest distance was selected at each step. Sometimes in a long period of clustering, these shortest distances were very close. This may result in little differentiation in clustering steps (see the joint marked by "I" in Figure 7), which may further intervene with the clustering process and classification mapping.
Moreover, the dendrogram of data through this method is in a ladder-like shape and shows an extended-chain structure, implying that links are inevitable. Therefore, the internal connections among samples may be obscured to some extent. This is because the distance between clusters was the shortest. After the two clusters were joined into a new cluster, the distance between the new cluster and any other clusters was shortened, so it was easier to form a large cluster, and most samples were joined in the same cluster. In addition, existing literature shows that single linkage is significantly affected by outliers [31], which limits its application in processing complex data.

Complete Linkage Method
BFW and PPW from the Jianxinpo Tunnel, USW of the emergency center, and rain appeared to have greater distance from the water leaked from the tunnel, suggesting a gradual weakening of the relationship. A relatively strong relationship between the water from the tunnel drainage system and water leaked in the tunnel could be inferred. However, their chemical composition was still slightly different because of different paths and seepage time.
In the complete linkage method, the distance between clusters was defined as the longest distance between the clusters, which made adjustments and improvement on the basis of single linkage. It avoided the inevitable generation of links in single linkage. After the two clusters merged, their distance to other clusters was considered to be the distance from one of the two clusters that had the largest distance. This method increased the distance between the merged cluster and other clusters, and avoided the inevitable generation of links and a ladder-like pattern. Compared to single linkage, the horizontal axis of the dendrogram was extended and covered a larger range in the complete linkage, which produced a more refined clustering result. Objects were further classified into small clusters, and could be used to better characterize the data. Despite its advantages, relevant literature shows that this method may result in many clusters and data distorted by outliers, when dealing with data having large dispersions [32].

Median Linkage Method
The sample order was the same in dendrograms of median linkage and single linkage. Furthermore, results showed the integrity of water leaks in the tunnel and a connection between the running water from the drain hole and BFW. This information was unclear in the previous results, indicating that this method is better in portraying certain details.
Nevertheless, anomalies were detected during clustering. As shown in steps 9, 11, and 16 in the dendrogram below, the distance for merging was less than the distance in the previous step. Reversals (labeled as "I, II, and III") were observed, which resulted in crossing lines and closed links. Given the non-monotonicity of median linkage, the clustering results were often unsatisfactory, and it was difficult to track links using the dendrogram [33]. Therefore, this method is rarely used.

Centroid Linkage Method
In centroid linkage, the sample order in a dendrogram was similar to that of single linkage and median linkage. In addition, its clustering process was similar to that of median linkage, especially with samples of water leakage in small clusters. The centroid linkage differed from median linkage in the middle stage of clustering. The median linkage strengthened the relationship between the running water from the drain hole and PPW from the Jianxinpo Tunnel, which was stronger than the connection with the water leaked from the tunnel. However, the water leaked from tunnel and the running water from the tunnel drain hole were considered to be within the same large cluster, so their correlation with BFW from the Jianxinpo Tunnel was poor.
Three anomalies were observed during the centroid linkage clustering where the distance for merging was less than the distance in the previous step. Similarly, anomalies occurred in steps 9, 11, and 16. This is the exact same order of anomalies in median linkage clustering. Even the first outlier (0.786) was the same. These small statistical values would inevitably cause partial reversals in the dendrogram. The three abnormal distances for merging were 0.786, 1.053, and 4.163, which correspond to closed links labeled as "I, II, and III ( Figure 10)" in the dendrogram, respectively.
Centroid linkage requires the Euclidean distance. Each time the two clusters were merged, the cluster centroids had to be recalculated. Therefore, this method is less affected by outliers.
While clusters were well represented by centroid linkage, reversals were likely to occur in dendrograms as the distance between clusters did not follow a monotonous increasing trend [27,34]. It is difficult to track links in the dendrogram, and the symbols may change frequently. In addition, it may involve complex calculation, which further limits its applications.

Between-Groups Linkage
According to the clustering results with between-groups linkage, the relationship between the running water from the drain hole and BFW from the Jianxinpo Tunnel was strengthened, though such an effect only occurred in step 14 of merging at the average Euclidean distance of 3.844 (Line B). Based on the clustering analysis with the aforementioned methods, it can be inferred that BFW from the Jianxinpo Tunnel had a closer connection with the water leaked and the running water in the Bayi Tunnel than other water samples.
As shown in the dendrogram below, between-group linkage resolved the issue in single and complete linkages where the distance between clusters was easily affected by extreme values. It defined the distance between two small clusters as the average distance between all sample pairs, which utilized the distance information of all sample pairs [35].

Within-Groups Linkage
Similar to between-groups linkage, the results of clustering with within-groups linkage showed a correlation between BFW from the Jianxinpo Tunnel and the running water from the drain hole in the Bayi Tunnel at an average Euclidean distance of 2.309 (Line A). During the within-group linkage clustering, the correlation between PPW from the Jianxinpo Tunnel, rain, and the water leaked from Bayi Tunnel was improved, which was not observed in the clustering results with the aforementioned methods.
The within-groups linkage method calculates the average distance of sample pairs, including the pairs between small clusters and pairs within the same cluster. Compared to between-group linkage, it considers the similarity of objects within the same cluster in each clustering step. This method makes use of the known information and considers all samples and individuals. As shown in the dendrogram below, this clustering method achieves good clustering results and has wide applications in practice.

Ward's Minimum-Variance Method
Compared to the aforementioned methods, the results and effects of clustering with Ward's minimum-variance method were most consistent with the original type of sample collections. This is because the method required the distance between samples in Euclidean distance, and the distance between two clusters was significantly affected by the number of samples in the two clusters. Therefore, the two clusters tended to be far apart, making it difficult to merge the two. Nevertheless, this approach often met the actual requirements for practical clustering. Therefore, this method performs well in differentiating objects and shows great resistance to interferences. The results of classification using this method were less affected by outliers. Its dendrogram was often clearly structured, straightforward, accurate, and well represented the classification results.
In dealing with the classification of small samples, Ward's minimum-variance method makes full use of the sample information to explore the internal connection in the data. In the event of little differentiation in samples, this method enlarges the differences between clusters and captures the essential attributes of clusters, thereby providing accurate and reliable classification results [27,36]. In the past, the application of Ward's minimum-variance method was restricted by the complicated calculations. With the growth of computational technology, it is no longer a great challenge to manage such calculations. Therefore, this method is a very effective clustering method in theory and practice.

Hydrochemical Characteristics
Traditional methods of hydrochemical analysis, Piper trilinear diagram, Schuka Lev classification, and Kurllov's formula were also conducted to interpret the geneses, connections, and the classifications of these water samples. As shown in Figure 14, Bayi Tunnel has a good aggregation of leakage water, and it is close to the rainfall with time passing by, which shows that the tunnel leakage water is strongly mixed by rainfall, and further shows that the rainfall has an extremely important impact on the leakage water of the tunnel. From different aspects of classification in Table 4, the leakage water types of Bayi Tunnel basically preserved the same, showing significant differences from the rainfall, the CEMC USW, the Jianxinpo Tunnel BFW and PPW, which is consistent with the results of CA. This indicates that the CA results of multivariate statistical methods and the results of traditional hydrochemical analysis had strong comparability and could be mutually verified.

Conclusions
(1) In the HCA, single linkage was the most basic, comprehensible, and accessible method, which reflected the concept of hierarchical clustering directly. However, it was limited by little differentiations in clustering steps and the inevitable linking tendency (as seen from the ladder-like shapes in dendrograms). Complete linkage adjusted and improved the basis of single linkage. It avoided the inevitable generation of links and ladder-shaped dendrograms. By increasing the distance between clusters for merging, clustering with complete linkage was more refined and data sensitive. However, both single and complete linkage were significantly affected by outliers, and were therefore ineffective when processing data with large dispersions; (2) Unlike single and complete linkage, median linkage avoided measuring extreme distances, whereas centroid linkage emphasized the representativeness of a cluster. The centroids of clusters had to be recalculated each time after every two clusters merged; therefore, centroid linkage performed more stably when dealing with outliers. However, given the non-monotonicity of these two methods, the distance for merging was likely less than the distance in the previous step, which may have led to reversals, partially closed and crossing links, or other issues in dendrograms. Therefore, these two methods were not recommended; (3) Average linkage was the default method in the HCA module in SPSS. It included two techniques (i.e., between-group linkage and within-group linkage), and both could make full use of known information. All samples and indicators were considered, and the clustering process was not easily affected by outliers. Average linkage performed well in clustering and was recommended for dealing with a large number of samples, complex variables, and indicators; (4) Ward's minimum-variance method could capture and enlarge the differences between clusters that were subtle, hidden, and difficult to identify using other methods, which was conducive to data classification. Using this method, more information could be delivered and expressed, which increased the classification accuracy. For classification tasks with fewer objects and variables, this method could effectively improve the accuracy and classification sensitivity, which could help to explore the essential attributes of data.