Role of Cluster Validity Indices in Delineation of Precipitation Regions

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.


Introduction
In hydro-climatology studies, a reliable estimate of precipitation is useful in planning, design, and management of urban water infrastructure, integrated watershed management, and analysis of extremes. The precipitation process is very complex and varies both spatially and temporally. Numerous techniques are developed to model the spatio-temporal variations of precipitation data over large areas. One of the popular techniques is regionalization (or to delineate the regions) based on their analogous/statistical characteristics of precipitation data and its associated attributes. The major factors effecting regionalization are: (i) the spatial correlations between the neighborhood stations; (ii) the non-linearity in the precipitation processes; and (iii) the spatio-temporal resolution of the data. The classical statistical methods [1,2] used to model the complex precipitation processes fail to capture the spatial statistics of the region [3]. Further, these methods require the inherent assumption that the process is Gaussian, which is not true in many practical applications [4]. On the other hand the advent of data mining methods such as clustering, Prinicipal Component Analysis, multisite boostrap, etc., are able to capture the complex characteristics of hydroclimatic variables such as precipitation, without any underlying assumptions of the process [5]. and/or its performance based on the selection of the attributes. However, limited studies reported the effect of CVI on the formation of homogeneous regions [9,10,51,52]. It is indicated that the CVIs do not result in a single ideal number of clusters [27,48,49] and thereby, it is envisaged that the selection of CVI plays a significant role in the delineation of precipitation regions.
The main objective of this study is to evaluate the performance of the cluster validity indices to improve homogeneity of the delineated precipitation regions. In addition, the performance of the CVIs due to seasonal variations are also examined. In this investigation, internal CVIs are employed to obtain the optimal number of regions (partitions), since the fundamental structure of the precipitation process is unknown (which is required for evaluation of external CVIs). The k-means clustering is adopted for the delineation of two large precipitation regions in Canada, namely, the Prairie region and the Great Lakes-St Lawrence (GL-SL) region. Both these areas have distinct climatic conditions due to their dissimilar geophysical characteristics and proximity to large water bodies. The location-based attributes such as elevation, latitude, and longitude are selected for the identification of homogeneous precipitation zones.
The remainder of this paper is structured as follows. The methodology on (i) clustering of regions using k-means algorithm; (ii) cluster validity indices; (ii) L-moments based homogeneity test are presented in Section 2. In Section 3, the salient features of the case study and data are presented. Section 4 provides the detailed results and discussion. Followed by the summary and conclusion of the study in Section 5. The Appendix A provides the details about the CVI and its mathematical form with selection criteria.

Methodology
Efficient clustering algorithms aim at identifying homogeneous precipitation regions and discarding the irregularities present in the data distribution. In this section the three main steps are presented: (i) delineation of regions using k-means clustering, (ii) cluster validity indices (CVI), and (iii) validation of regions for homogeneity using L-moments ratio test, involved in the regionalization of precipitation zones. The framework for the delineation of precipitation regions is presented in Figure 1.

K-Means Clustering Algorithm
Clusters can be identified in a given set of data by determining a local minima solution through an iterative procedure, as McQueen [53] illustrated. This commonly used procedure is known as k-means clustering algorithm. This algorithm positions k centers amongst the data set and then assign the sites to its nearest center [54]. Those attributes having higher value have a more significant effect on the subsequent delineated clusters by k-means algorithm [9]. Therefore, the attributes are re-scaled to reduce the effect of their variances and relative magnitudes using, where, i and j represent attribute and site, respectively; x ji is the re-scaled value of x ji ; µ j and σ j are the mean and the standard deviation, respectively for all sites. Levine [55] proposed a procedure to cluster the data set by minimizing the distance of every site to each of the k cluster centers by re-assigning the attributes among the clusters. The objective function f is given as, where n-dimensional attribute space is denoted by n; N total number of features in the cluster and c j is the value of attribute j at the cluster's centroid as defined in Equation (3); d() represents Euclidean distance measure; and k defines the number of centroids initialized beforehand.
The steps involved in k-means clustering is shown in Figure 2. One of the major challenges in k-means clustering is to identify the initial guess for the number of clusters (k), before the procedure starts. This has a significant effect on the performance of the objective function in attaining the optimal number of clusters. Too many clusters will lead to outlining regions that do not exist, and too few will lead to poor differentiation among distinctly different neighborhoods. In this study, the number of clusters k is varied from 5 to 150 with an interval of 5. The CVIs are used to select the optimal value of k.

Cluster Validity Indices (CVIs)
Number of studies have used CVIs to obtain the optimal number of clusters (k) for a given attribute [20,27,[44][45][46][47][48][49][50]. In this paper, the performances of 27 internal CVIs presented by Desgraupes [50] are compared for delineating the precipitation zones for two regions in Canada. The Dunn index is further classified into an additional 15 indices, resulting in a total of 42 internal CVIs. The goodness of the cluster is evaluated based on the information available from the data [56]. All the indices used in this study are presented in Appendix A, along with their mathematical form and selection criteria. The readers are referred to R programming based cluster package clusterCrit [50] for detailed information on the CVI.

L-Moments Homogeneity Test
Homogeneity tests are carried out to evaluate the statistical coherence of the clusters formed based on the attributes adopted for delineation of precipitation regions [57]. One of the broadly used procedure for testing the homogeneity of the clusters is L-moment homogeneity test [13]. The rationale of the homogeneity test is that population L-moment ratios is the same for the homogeneous cluster sites but different for their subset due to the variability in sampling. The advantages of the test are: (i) higher-order moments are estimated with minimum possible error; (ii) test can be performed with a wide range of distributions; (iii) test is robust to the data set; and (iv) results will be well interpreted since the value of L-moments vary between −1 and 1.
In this study, the weighted standard deviation is used to calculate sample coefficient of at-site L-variation (L-CV), which will be adopted as a heterogeneity measure [8]. Consider the cluster to be validated to have N number of sites and the length of rainfall dataset of site i to be n i . The Equations (4)-(6) provide L-moment ratios for site i, L-CV (l 2 i ), L-skewness (l 3 i ), and L-kurtosis (l 4 i ).
where, p j is the j th observed precipitation measured at i th site. Therefore, the regional average L-CV (l R i ), L-skewness (l R 3 ), and L-kurtosis (l R 4 ) are computed, as shown in Equations (7)-(9). The regional average mean (l R 1 ) is set to 1 by scaling precipitation totals at each site by their mean values.
Homogeneity of a cluster between site dispersion (D) is measured using, The homogeneity of the clusters is compared using Kappa distribution derived from regional average L-moment ratios. In this study, 500 realizations are simulated from Kappa distribution for the study area. For each simulated regions, the value of D is determined. Let µ D and σ D be the mean and standard deviation of 500 realizations, respectively. Then the heterogeneity measure for a cluster is defined as, If H < 1, then the cluster will be considered as "acceptably homogeneous"; else if 1 ≤ H < 2, then the cluster will be regarded as "possibly homogeneous"; else if H ≥ 2, then the delineated precipitation region is stated as "definitely heterogeneous". For detailed information on L-moments the readers are referred to [13].

Case Study
The methodology is employed in two climatic regions of Canada as shown in Figure 3, namely, the Prairies in Western Canada and the Great Lakes-St Lawrence lowlands (GL-SL) in Southern Ontario. These regions play a significant impact on agriculture, water infrastructure and economics of major cities such as Calgary, Edmonton, Regina, Winnipeg, Ottawa, Quebec, Montreal, and Toronto. The Prairie region largely depends on the frequency and the amount of precipitation. The precipitation in the region is mainly governed by large-scale atmospheric circulation and geophysical characteristics [10]. On the other hand, the climate of the GL-SL region is significantly affected due to the variations in the moisture content drawn from the large water bodies surrounding this region and the geophysical characteristics [58].
Prairie region is located at the southern border of Canada extending in three provinces (Alberta, Saskatchewan, and Manitoba) and in north-east extending up to the Rocky Mountains, as shown in Figure 3. The approximate extent of spatial boundaries varies from 49 to 54 degrees latitude and from −117 to −95 degrees longitude. Across this region, the distribution of precipitation varies along the provinces with maximum precipitation in Manitoba and minimum in the Saskatchewan [59]. The region is also characterized by seasonal precipitation variation [11]. The GL-SL region stretches from Southern Ontario to Quebec provinces and lies approximately between latitudes and longitudes of 42 to 48 degrees and −83 to −70 degrees respectively, as shown in Figure 3. The Great Lakes region covers Southern Ontario to the south of the Canadian Shield. The region experiences variable weather patterns due to the cold and dry air from the north, predominant winds and humid air blowing from the west and the Gulf of Mexico, respectively [60]. In the period from November to February, a cold air mass accumulates moisture by passing over the warm water bodies in the region and results in downwind precipitation [61]. In general, the precipitation in summer season in the Great Lakes is characterized by cloud bursts. In addition, the total annual precipitation varies across the region, with highest in the eastern part, i.e., St. Lawrence Valley sub-region experiences more annual precipitation than Great Lakes sub-region [11].

Data
Regionalization of precipitation regions requires good quality of spatial and temporal data. In the case of large-scale studies, the distribution of meteorological stations is not adequate for accurate estimation of regional parameters. In recent years, several studies have adopted gridded reanalysis or satellite data in the absence of historical data [11,16,[62][63][64]. The major limitation of satellite data is the record length when compared to the reanalysis data. In this study, ANUSPLIN [65][66][67], a high-resolution gridded precipitation data is used for delineation of precipitation zones. This dataset has been used in several studies [11,[68][69][70]. The grid map in Figure 3 shows the size of ANUSPLIN grid locations for both the regions. The data is approximately 300 arc second, and it contains 10810 and 3840 grid point locations within Prairie and GL-SL regions, respectively. It is to be noted that in Figure 3, due to the scale of the map, the grid location-based attributes appear to be a solid color. The data is ranging between years 1951-2005, for a record length of 55 years. The meta-data from the ANUSLIN is used to extract the location-based attributes for k-means clustering algorithm.

Results and Discussion
The daily precipitation series from ANUSPLIN point gridded data is aggregated to four seasons, namely winter (DJF), spring (MAM), summer (JJA), and autumn (SON). In this study, the location-based attributes such as elevation, latitude, and longitude are selected for delineation of precipitation regions. Asong et al. [10] and Irwin et al. [11] have also shown that the location-based attributes have a significant effect on the homogeneity of the regions. The Digital Elevation Model over the study regions is obtained from the Canadian Digital Elevation Data (CDED), which can be accessed from the Canadian GeoBase website (http://www.geobase.ca/). The elevation data from DEM is extracted at the latitude and longitude of the ANUSPLIN grid locations using Natural Neighborhood based interpolation method in ArcGIS 10.2.

Performance of the CVIs
The k-means clustering is adopted for the delineation of two large precipitation regions in Canada. The major limitations of this algorithm are the accuracy of the initial random centroids location and the identification of the optimal number of clusters, since it lacks theoretical background when compared to supervised algorithms [49]. Several studies have used an iterative process with initial random centroids to select the best clusters [11,15,71]. In this study, the k-means algorithm is executed for 50 iterations with varying initial random centroids, and the best cluster is selected for further analysis. Followed by, the identification of the optimal number of clusters is carried out using the CVIs. Finally, the optimal number of clusters is validated using the L-moments based homogeneity test. The regional percentage homogeneity represents the ratio of the total number of homogeneous clusters to the optimal number of clusters identified by CVI.
An illustrative example of the performance of two CVIs (Det-ratio and Log-SS-Ratio) is presented in Figure 4. The comparison is based on the percentage homogeneity, number of clusters and the CVI's selection criteria. The Figure 4 describe: (i) CVI values (on y-axis); (ii) percentage homogeneity (on x-axis); (iii) the numerical in the panel indicate number of clusters resulting in the pair of CVI value and percentage homogeneity; (iv) horizontal y-intercept red line represents CVI's ideal value based on the selection criteria presented in Table A1; (v) vertical x-intercept blue line represents the value of percentage homogeneity with respect to the CVI's ideal value in (iv). In Figure 4, the CVI value, number of clusters, and percentage homogeneity corresponding to the CVI's ideal value (red lines) are (82003, 85, 87.5%) and (3.900, 70, 80%) for Det-Ratio and Log-SS-Ratio, respectively. The selection of CVI is based on the higher value of percentage homogeneity among the CVIs. In case, if the CVIs have similar/equal percentage homogeneity, then the CVI with the lower number of clusters is selected. In the example illustrated, it is evident that the Det-Ratio is able to obtain a higher value of percentage homogeneity (87.5%) when compared to Log-SS-Ratio which is (72%) and hence the Det-Ratio based CVI is selected. Similarly, the relative performance of several CVIs can be assessed based on the above selection procedure.

Application to Canadian Regions
The performance of the 42 CVIs for four seasons in the two Canadian Regions are presented in Figures 5-12. The number of clusters is varied from 5 to 150 at an interval of 5, and the ideal number of clusters is selected based on the CVI selection criteria as presented in Table A1. For brevity, the results for the number of clusters is limited to 100, since no significant changes were observed beyond this cluster number. In addition, the results are also compared with the performance of the empirical formula. The number of clusters formed using empirical formula is 27 and 22 for the Prairie and GL-SL region, respectively. It is to be noted that Irwin et al. [11] found that the optimal number of clusters is significantly different from empirical formula for each season and both the regions of Canada. The results for each region is presented as follows:

Spring Season:
|T| |W| ratio index found to be the best in delineation of the region in 80 clusters out of which 69 clusters are homogeneous.

3.
Summer season: All the indices except, In addition, with a notable comparison between seasons, seven CVIs delineate the region with highest percentage homogeneity for the Autumn season, ten indices regionalize with best percentage homogeneity in the Winter season, and three indices provide similar results for both winter and autumn seasons as shown in Table 2. In addition, it is observed that among the selected best indices, the Dunn and Trace_WiB index provides a smaller number of clusters when compared to other indices in Winter and Autumn seasons.

Great Lakes-St. Lawrence Lowlands (GL-SL) Region
Great Lakes-St. Lawrence lowlands region is comprised of 3840 sites. The region was divided into 22 clusters based on the empirical formula, for which the percentage homogeneity are observed to be: (i) Winter season-86%, (ii) Spring season-78%, (iii) Summer season-76%, and (iv) Autumn season-81%. Figures 9-12 presents the results for the respective CVI, elucidating the value of indices with respect to the percentage homogeneity. Table 3 illustrates the 15 cluster validity indices which have performed well in the region, along with their determined number of clusters and corresponding percentage homogeneity. It is observed from Table 3 that the percentage homogeneity is higher in Great Lakes region when compared to Prairie region. This could be due to their differences in geophysical characteristics, wherein the Prairie region is significantly affected by the Rocky Mountain ranges and is difficult to model [11]. In addition, Table 3 shows that the indices have determined an absolute 100% percentage homogeneity for the autumn season. In this region, the following are observed: Further, it is observed that for all the four seasons the best CVI show notably a narrow range of percentage homogeneity for each region. The performance of the clusters is lower for the spring season when compared to other seasons for both the regions. This could be due to weaker and/or stronger influences of ENSO cycle usually makes it difficult in the prediction of precipitation in spring season [72]. In addition, from Tables 1-3 it is observed that for both the regions the following CVI are selected for regionalization studies based on the minimum number of clusters to achieve similar percentage of homogeneity in respective seasons:

Conclusions
In this study, the performance of the various cluster validity indices are investigated in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions for Prairie and GL-SL regions in Canada. It is evident from the results that the optimal number of clusters and the regional homogeneity depends on the CVI adopted, location of the study area, and seasonal variations. Out of 42 CVIs, about 14-15 indices perform better in preserving the homogeneity of the clusters, and there is no single CVI among the best-selected which outperform the others. The Dunn index, Det_Ratio index, and Trace(W −1 B) index, are found to be the best for all seasons in both the regions. The study provides the possibility of improvement in the prediction of hydro-climatic variables and their applications such as meteorological droughts, design of hydraulic regulatory structures, downscaling of hydroclimatic variables, watershed management, and prediction in ungauged basins. Further, the variations in homogeneity due to CVIs will be helpful in the quantification of uncertainty and sensitivity of attributes in delineation of precipitation zones.

•
Although the k-means algorithm converges well there is a tendency of solutions not reaching global optima. It requires a number of iterations to get the best solution using random sets of initial centroids. The computational burden is high with an increase in number of grid points (or stations) as well as length of records. The performance of the CVI can be evaluated using other clustering algorithms which may elevate above issues. Further, investigation can be carried out to represent/understand the regional processes by studying the similarity, separation, and cohesion of the clusters in the region.

•
The process is assumed to be stationary, which may not be true, especially under the effect of climate change. The non-stationary algorithms can be used for delineation of precipitation regions.

•
The season-wise performances of CVIs vary significantly in both regions. This could be due to the effect of combined influences of large-scale variables and geophysical characteristics [72]. These attributes are very important to study the effect of climate change on hydrological variables. Moreover, it is envisaged that the selection of appropriate attributes according to the seasons may result in better prediction of rainfall characteristics. In this study attributes are limited to geophysical characteristics. The research is in progress to understand the role of CVI with additional climate-based attributes and their seasonal variations.

•
The other limitation is the non-availability of sub-daily data for ANUSPLIN [66,67]. For example, in case of design and management of water infrastructure, the development of intensity-duration-frequency precipitation curves require sub-daily data which is not available in ANUSPLIN. Alternatively, the availability of sub-daily reanalysis data such as NCEP-NARR [73] (updated data release 2016) can be used. Further, a disaggregation model can be adopted to generate sub-daily data from ANUSPLIN.

•
The study is in progress to (i) identify the role of climate indices for various combinations of attributes, clustering algorithms, and cluster validity indices; (ii) analyze number of study areas across the globe to generalize the selection of CVI, or to identify the best set of CVIs for their respective climate zones. and constructive comments. The second author is thankful to the members of HydroSystems Research Laboratory, IIT Tirupati for their encouragement and support.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The details of the CVI used in this study are presented in Table A1. The table provides the mathematical equation and the selection criteria for each of the CVI. Table A1. Cluster validity indices from clusterCrit R package developed by Desgraupes [50]. The clusterCrit name for each cluster index is presented in parenthesis. If the selection criteria is, minimum difference or maximum difference then, the clusters corresponding to the index value at which there is minimum slope or maximum slope difference versus number of clusters curve should be chosen.

Cluster Validity Mathematical Equation Selection Index Criteria
Ball-Hall where, C is the Cluster Validity Index; M [k] is the matrix of Difference attributes of k th cluster; G [k] is the barycentre # of attributes in k th cluster; n k is number of sites in k th cluster; K is the total number of clusters Banfeld-Raftery Minimum where, N W is the number of pairs of distinct sites in a cluster; S W is the sum of N W distances between all the pairs of sites inside each cluster; S min is the sum of N W smallest distances between all the pairs of sites in the entire dataset; S max is the sum of N W largest distances between all the pairs of sites in the entire dataset where, N B is the number of pairs of sites which do not belong to the same cluster; M i is the row matrix of i th site's attributes; d() is the Euclidean distance between sites PBM (PBM) Table A1. Cont.

Cluster Validity Mathematical Equation Selection Index Criteria
Point Biserial where, N T is the total number of pairs of distinct sites in the dataset Calinski-Harabasz Minimum where, X is the matrix formed by centred vectors V j is the j th observed attribute; µ j is the barycentre # of j th observed attribute Table A1. Cont.

Cluster Validity Mathematical Equation Selection Index Criteria
Gamma (Gamma) C = Γ = s + −s − s + +s − Maximum where, s + is the number of pairs of sites which are not in same cluster and whose distance is less than those which are in same cluster; s − is the number of pairs which are in same cluster and whose distance is more than those which are in same cluster where, H kk is the mid-point of G k and G k ; γ kk () is the total number of sites in these k th and k th clusters whose distance to the given point is less than