A Silhouette-Width-Induced Hierarchical Clustering for Deﬁning Flood Estimation Regions

: Flood quantile estimation in ungauged basins is often performed using regional analysis. A regionalization procedure consists of two phases: the deﬁnition of homogeneous regions among gauged basins, i.e., clusters of stations, and information transfer to the ungauged sites. Due to its simplicity and widespread use, a combination of hierarchical clustering by Ward’s algorithm and the index-ﬂood method is applied in this research. While hierarchical clustering is very efﬁcient, its shortcomings are the lack of ﬂexibility in the deﬁnition of clusters/regions and the inability to transfer objects/stations from one cluster center to another. To overcome this, using silhouette width for induced clustering of stations in ﬂood studies is proposed in this paper. A regionalization procedure is conducted on 53 gauging stations under a continental climate in the West Balkans. In the induced clustering, a negative silhouette width is used as an indicator for the relocation of station(s) to another cluster. The estimates of mean annual ﬂood and 100-year ﬂood quantiles assessed by the original and induced clustering are compared. A jackknife procedure is applied for mean annual ﬂood estimation and 100-year ﬂood quantiles. Both the Hosking–Wallis and Anderson–Darling bootstrap tests provide better results regarding the homogeneity of the deﬁned regions for the induced clustering compared to the original one. The goodness-of-ﬁt measures indicate improved clustering results by the proposed intervention, reﬂecting ﬂood quantile estimation at the stations with signiﬁcant overestimation by the original clustering.


Introduction
Flood frequency analysis (FFA) is commonly used in hydrology to relate extreme river flow to its frequency of occurrence, i.e., to estimate flood quantile [1]. The application of statistical methods in this analysis implies long-term, reliable gauge flow data [2]. For instance, a known issue with gauged high flows is the uncertainty of the rating curve that provides the relationship between the water flow and the corresponding water stage in that particular domain of interest [3,4]. The application of statistical methods implies homogeneous and stationary datasets, which contradicts the increasing evidence for climate change [5,6], although some flood-related studies suggest the opposite [7][8][9][10]. Assuming that the datasets are homogeneous and reliable, the next question is whether the population distribution function and its parameters can be reliably estimated from the available data record length.
Some countries have suggested procedures, distribution functions, and parameter estimation methods for FFA. In the USA, the Log Pearson 3 (LP3) function is used, with the

Study Region and Data
The study region is the southern part of the Danube River basin, characterized by a continental climate. The annual peak flow data were collected for 64 hydrological stations (HSs), with a catchment area of up to 3000 km 2 in the territory of Bosnia and Herzegovina (19 HSs) and Serbia (45 HSs). The data were quality assured by the respective hydrometeorological services. A total of 11 HSs were excluded from this study due to different reasons, including short peak flow datasets, a flow regime regulated by upstream reservoirs, and difficulty in defining a catchment border due to karst. The locations and catchment boundaries of the remaining 53 HSs are shown in Figure 1.
The peak flow datasets of 53 HSs in the gauge period 1948-2016 were subjected to statistical tests to determine suitability for statistical analysis, i.e., flood frequency analysis (tests of stationarity, independence, and homogeneity in mean and variance). In 12 HSs, the datasets were trimmed (cut off) from the beginning of a gauge period until they were found to be suitable for statistical analysis at the 5% significance level of the applied tests. The trimmed gauged peak flow data range is from 1 to 22 years. The completeness and gauge period of the datasets are shown graphically in Appendix A ( Figure A1). Long data gaps are characteristic for the HSs in Bosnia and Herzegovina, starting from HS no. 43 in Figure A1. The incomplete datasets were not augmented. The average size of the datasets is 51 years. Table A1 in Appendix A contains the analyzed datasets' flood statistics for the 53 HSs. The peak flow datasets of 53 HSs in the gauge period 1948-2016 were subjected to statistical tests to determine suitability for statistical analysis, i.e., flood frequency analysis (tests of stationarity, independence, and homogeneity in mean and variance). In 12 HSs, the datasets were trimmed (cut off) from the beginning of a gauge period until they were found to be suitable for statistical analysis at the 5% significance level of the applied tests. The trimmed gauged peak flow data range is from 1 to 22 years. The completeness and gauge period of the datasets are shown graphically in Appendix A ( Figure A1). Long data gaps are characteristic for the HSs in Bosnia and Herzegovina, starting from HS no. 43 in Figure A1. The incomplete datasets were not augmented. The average size of the datasets is 51 years. Table A1 in Appendix A contains the analyzed datasets' flood statistics for the 53 HSs.
The catchment morphological attributes, catchment area (A), average slope (Iavg) and average elevation (Havg), are taken from previous research, determined from the 20 m resolution DEM [42].

Methodology
The research process consists of four phases, as shown in Figure 2 as a computational workflow divided into 6 steps. The initial phase is the formation of flood estimation regions ( Figure 2, column 1), the second phase is flood quantile estimation in pseudo-ungauged catchments (Figure 2, columns 2-4), the third phase is estimation of 'true' flood quantiles by FFA, and the final phase is the comparison of regionally assessed flood quantiles and true ones. The catchment morphological attributes, catchment area (A), average slope (Iavg) and average elevation (Havg), are taken from previous research, determined from the 20 m resolution DEM [42].

Methodology
The research process consists of four phases, as shown in Figure 2 as a computational workflow divided into 6 steps. The initial phase is the formation of flood estimation regions ( Figure 2, column 1), the second phase is flood quantile estimation in pseudo-ungauged catchments ( Figure 2, columns 2-4), the third phase is estimation of 'true' flood quantiles by FFA, and the final phase is the comparison of regionally assessed flood quantiles and true ones. One subjective method and two objective methods are used to form the regions. The subjective method implies that all stations form one region (label 1REG), while in the objective methods, regions are formed by the cluster analysis (label CLUSTER). The results of cluster analysis are the regions CLUSTER_org (i.e., original) and CLUSTER_adj (i.e., adjusted). The difference between the two regions (CCs) is in the adjustments made to CLUSTER_adj from CLUSTER_org to achieve positive silhouette widths of all objects (HSs) in CCs (Figure 2, column 1).
The regional analysis starts with the formation of regions where the 53 HSs are dis- One subjective method and two objective methods are used to form the regions. The subjective method implies that all stations form one region (label 1REG), while in the objective methods, regions are formed by the cluster analysis (label CLUSTER). The results of cluster analysis are the regions CLUSTER_org (i.e., original) and CLUSTER_adj (i.e., adjusted). The difference between the two regions (CCs) is in the adjustments made to CLUSTER_adj from CLUSTER_org to achieve positive silhouette widths of all objects (HSs) in CCs (Figure 2, column 1).
The regional analysis starts with the formation of regions where the 53 HSs are distributed among clusters according to similarity. Then, one HS is left out (pseudoungauged catchment), and the homogeneity of its region (CC) is tested using the Hosking-Wallis (HW) test, Anderson-Darling (AD) bootstrap test, and identification of discordant stations. In such a region, both mean annual flood (MAF) and a regional growth curve are estimated ( Figure 2, columns 2-4). The result is appraised with respect to flood quantile at studied HSs estimated by FFA (Figure 2, column 5). The procedure is repeated for each HS and the three regionalization methods. The regionalization methods are generally evaluated through percent bias (PB) and regional root mean square error (RRMSE) (Figure 2, column 6).

Formation of the Regions
The subjective approach to the formation of regions leads to one region comprising all HSs (1REG). A selected objective approach is the application of one of the clustering methods, leading to the formation of two sets of regions (CLUSTER).
In the cluster analysis, catchments are allocated to regions using a hierarchical agglomerative approach that classifies objects (catchments) according to morphological similarities. The morphological attributes should not exhibit significant correlation to avoid giving too much importance to individual attributes [43]. At the beginning of hierarchical clustering, each HS forms an object (cluster) for itself, and through a clustering algorithm, merging is performed at each level (height) until the clusters become one region. Ward's algorithm is applied here, known for its advantage in the formation of compact regions [18,44].
The silhouette width is used for the selection of the number of centers. It is calculated as the ratio of the average distance of the object i from the others falling into the same center j and the maximum distance d from the objects of other centers: where s j i is the silhouette width of the object i from the cluster (center) j, a j i is the average distance between i-th vector in cluster C j and other vectors in the same cluster, and b j i is the minimum average distance between i-th vector in cluster C j and other vectors in the remaining clusters C k (k = 1, . . . , K, k = j): where d X j i , X j k is the distance between objects X i and X k , m j is the number of objects in cluster C j , m n is the number of objects in remaining clusters C k , and K is the number of cluster centers.
The silhouette width range is [−1, 1]. The closer the silhouette width is to 1, the more compact the cluster, and the object is assigned to an adequate center.
The selection of a number of centers is performed by maximizing the mean silhouette width (MSW). The label for this clustering method is _org (original), therefore regions are labelled CLUSTER_org. An issue in the _org clustering method is the presence of silhouette widths close to zero and negative. In the former case, it is inconclusive whether the object is in the adequate center. Negative silhouette width is used as an indicator of the wrong center, and the regions are adjusted accordingly to achieve positive silhouette widths (CLUSTER_adj).
The adjustment starts from the original regions CLUSTER_org and it is checked whether the smallest value of the silhouette width is negative. If so, the object is moved to the neighboring CC, and the silhouette widths are recalculated. The procedure stops when all silhouette widths are positive, as shown in the flow diagram in Figure 3. . The silhouette-width-induced clustering-an adjustment procedure of the region size to achieve positive silhouette widths.

Region Homogeneity Testing
The region homogeneity is tested for each regionalization method and HS. The pseudo-ungauged HS is left out from its region, and the homogeneity of the remaining mj − 1 stations in the region is examined. Two tests were used: the parametric Hosking-Wallis (HW) test and the Anderson-Darling (AD) bootstrap test. Additionally, the number of discordant sites (HSs) is sought.
The HW test is based on the comparison of deviations (dispersion) of dimensionless moments with deviations in homogeneous regions. If the deviations are small enough, they are attributed to sampling error and the region is considered homogeneous.
The measures of dispersion are [17] Figure 3. The silhouette-width-induced clustering-an adjustment procedure of the region size to achieve positive silhouette widths.

Region Homogeneity Testing
The region homogeneity is tested for each regionalization method and HS. The pseudoungauged HS is left out from its region, and the homogeneity of the remaining m j − 1 stations in the region is examined. Two tests were used: the parametric Hosking-Wallis (HW) test and the Anderson-Darling (AD) bootstrap test. Additionally, the number of discordant sites (HSs) is sought.
The HW test is based on the comparison of deviations (dispersion) of dimensionless moments with deviations in homogeneous regions. If the deviations are small enough, they are attributed to sampling error and the region is considered homogeneous.
The measures of dispersion are [17] where t, t 3 , and t 4 are dimensionless L-moments at HS (i) and region (R). Regional dimensionless moments are calculated as average by weighting local values by dataset size n i : Following the procedure of Hoskings and Wallis [17], dispersion of each measure expected in homogeneous regions is obtained via 500 Monte Carlo simulations (N sim = 500). The four-parameter kappa distribution is specified. Based on dispersion statistics (mean µ V and standard deviation σ V ) H1, H2, and H3 measures are calculated as [17] A region is considered 'acceptably homogeneous' if H test statistics are less than 1 and 'definitely heterogeneous' for values greater than 2. The range between 1 and 2 is the 'possibly homogeneous' region. Even in extremely heterogeneous regions, H2 and H3 rarely reach values higher than 2 [19], so only H1 is used to check the homogeneity of the region.
In addition to the HW test, the AD bootstrap test was performed, recommended in cases of significant regional LCs values (greater than 0.23) [45]. The test is based on the comparison of local and regional empirical distribution functions as whereF i (x) is an empirical function of the i-th sample (HS) of length n i , and H N (x) is a regional empirical function defined from sample size N = n 1 + · · · + n k .
To avoid false conclusions about homogeneity, the nonparametric bootstrap procedure [45] is applied.
The third region homogeneity insight is obtained from the detection of discordant HSs. The discordant HS is the one far away from the centroid of the cloud of points in the dimensionless L-moments space.
The distance of each point is defined as [17]  where u i is the vector of dimensionless moments at HS i u i = t (i) , t (i)

, u is the vector of unweighted individual vectors, and matrix A is defined as
The critical test values depend on the region size and range from 2.491 for 5 HSs to 3 for regions with more than 15 HSs [17].

Flood Information Transfer
The index-flood method is intended for regional flood frequency analysis [46,47]. This method combines data from several HSs to obtain one theoretical distribution function, i.e., regional growth curve, supposing an index (scaling factor) can be found at ungauged HS elsewhere or by some other method [17].
The mean annual flood (MAF) is selected as an index-flood statistic (index) here. For the pseudo-ungauged HS i, MAF is calculated by regression analysis, with one independent variable, a morphologic attribute: Catchment area A, the average catchment elevation Havg, and the average catchment slope Iavg are considered as attributes.
Estimation of regression coefficients a and b in Equation (14) is performed using the least square method for 1REG and CLUSTER regions.
The regional growth curve is adopted based on the proximity of regional values on the L-moment plot (Section 3.3.2). The GEV distribution is selected for regional distribution with a regional growth curve.

At-Site Flood Quantile Estimation
The selection of GEV distribution for at-site flood quantile estimation is based on the research results for the same study region (HSs) [34], where five distribution functions in FFA are tested: Log-normal (LN), Gumbel (GUMB), exponential (EXP), Pearson 3 (P3), Log Pearson 3 (LP3), Generalized Extreme Values (GEV), and General Logistic (GLO). The Probability Plot Correlation Coefficient (PPCC) test showed that the GEV, LP3, and GLO functions are overall best ranked among the tested distribution functions [34].
GEV has been confirmed as the best distribution function in a large number of European basins [48,49] and beyond [50,51]. As a positive feature, the consistent estimation of the confidence interval is found [52], as well as minor sensitivity of the right distribution tail to low outliers, which is of considerable practical importance in flood analysis [53].

Efficiency of the Regionalization Methods
The efficiency of the applied regionalization methods for flood quantile assessment is analyzed for variables estimated in the regionalization procedure: MAF, 100-year flood quantiles, and the ratio of MAF to 100-year flood quantiles ( Table 1). The flood quantile Q100 is selected for the efficiency analysis because an average peak flow dataset size in HSs of N = 51 years corresponds to the recommendation for the reliability of quantile estimation related to a return period of T < 2N-3N [13,54]. Two efficiency measures are calculated: relative bias and regional root mean square error (Equations (15) and (16)). Variables estimated in the at-site FFA are considered 'observed' in the efficiency assessment, as shown in Table 1.
Relative bias of regionalized (modelled) values: Regional Root Mean Square Error (RMSE):

Formation of Regions
Regions were formed in two ways: subjectively (a region comprising all HSs, label 1REG) and by hierarchal cluster analysis applying Ward's algorithm (label CLUSTER), where the selection of similarity attributes was performed according to correlation significance (p-value) calculated for Pearson R 2 and Spearman ρ (rho) coefficients ( Table 2). According to the results, basin area (A) and mean catchment elevation (Havg) are selected. The merging flow of HSs in the objective regionalization method is shown by a dendrogram, providing the opportunity to designate regions for any number of CCs ( Figure 4). The selection of a number of centers was performed by maximizing the mean silhouette width (MSW) ( Figure 5). However, this cannot be a sole criterion because it is recommended that the number of objects in clusters is between 5 and 20 in hydrological research [24]. A compromise is found with 6 CCs, with 15, 5, 5, 5, 7, and 16 HSs comprising region CLUSTER_org ( Figure 4). The locations of catchments according to CCs are shown in Figures A2 and A3.  In the CLUSTER_org, HS20 and HS45 have a negative silhouette width ( Figure 6). It took four iterations from CLUSTER_org to obtain CLUSTER_adj by the positive silhouette-width-induced clustering. The silhouette widths calculated in the original and adjusted procedures for each HS are shown in Figure 6. Three HSs moved from CC2 to CC1 (HS 2, 45, and 49), and HS 20 moved from CC1 to CC5. The difference in silhouette widths at some stations is notable, e.g., HS 1, HS 15, and HS 41, for better but HS 27, HS 50, and HS 51 for worse.   In the CLUSTER_org, HS20 and HS45 have a negative silhouette width ( Figure 6). It took four iterations from CLUSTER_org to obtain CLUSTER_adj by the positive silhouette-width-induced clustering. The silhouette widths calculated in the original and adjusted procedures for each HS are shown in Figure 6. Three HSs moved from CC2 to CC1 (HS 2, 45, and 49), and HS 20 moved from CC1 to CC5. The difference in silhouette widths at some stations is notable, e.g., HS 1, HS 15, and HS 41, for better but HS 27, HS 50, and HS 51 for worse. In the CLUSTER_org, HS20 and HS45 have a negative silhouette width ( Figure 6). It took four iterations from CLUSTER_org to obtain CLUSTER_adj by the positive silhouettewidth-induced clustering. The silhouette widths calculated in the original and adjusted procedures for each HS are shown in Figure 6. Three HSs moved from CC2 to CC1 (HS 2, 45, and 49), and HS 20 moved from CC1 to CC5. The difference in silhouette widths at some stations is notable, e.g., HS 1, HS 15, and HS 41, for better but HS 27, HS 50, and HS 51 for worse.

Region Homogeneity
Regional homogeneity was tested according to the H1 measure (HW test) and p1 and p2 values (AD bootstrap test adopting mean and median as index values, respectively). The conclusion was derived based on 500 simulations conducted in R using the {nsRFA} library. The consistency of HW and AD bootstrap tests is shown by the number of regions that matched the test conclusion (both accepted or both rejected homogeneity null hypothesis at the 5% significance level) and the number of regions where this was not the case (inconsistent results) ( Table 3). Homogeneity was tested for each pseudo-ungauged HS (its region). A comparison of homogeneity testing results in Table 3 shows that _adj clustering provided more consistent results, e.g., the number of HSs with inconsistent conclusion in _adj is 10/2 (p1 and p2), i.e., 10/4 (p1 and H1) compared to 22/14 and 17/11 in _org clustering.  Figure 7 shows an individual homogeneity class at each HS based on the results consolidated in Table 4 (p1 and H1) as well as the presence of a discordant station when that HS is left out (pseudo-ungauged). From Figure 7, it can be seen that twelve HSs should further produce reliable results in both _org and _adj cluster centers because their estimations will come from definitely homogeneous regions, free from discordant stations. These

Region Homogeneity
Regional homogeneity was tested according to the H1 measure (HW test) and p1 and p2 values (AD bootstrap test adopting mean and median as index values, respectively). The conclusion was derived based on 500 simulations conducted in R using the {nsRFA} library. The consistency of HW and AD bootstrap tests is shown by the number of regions that matched the test conclusion (both accepted or both rejected homogeneity null hypothesis at the 5% significance level) and the number of regions where this was not the case (inconsistent results) ( Table 3). Homogeneity was tested for each pseudo-ungauged HS (its region). A comparison of homogeneity testing results in Table 3 shows that _adj clustering provided more consistent results, e.g., the number of HSs with inconsistent conclusion in _adj is 10/2 (p1 and p2), i.e., 10/4 (p1 and H1) compared to 22/14 and 17/11 in _org clustering.  Figure 7 shows an individual homogeneity class at each HS based on the results consolidated in Table 4 (p1 and H1) as well as the presence of a discordant station when that HS is left out (pseudo-ungauged). From Figure 7, it can be seen that twelve HSs should further produce reliable results in both _org and _adj cluster centers because their estimations will come from definitely homogeneous regions, free from discordant stations. These HSs are HS 43 (CC1), HS 24

Flood Information Transfer
The selection of an appropriate attribute (independent variable) in the regression equation (14) for MAF prediction is based on the relative bias of MAF estimation via attributes A, Iavg, and Havg for region 1REG. The best agreement of MAF (the smallest relative bias range) is achieved for catchment area (A) with the highest values of R 2 (close to 0.8). Still, a relative bias at individual HSs ranges from approximately −60% to 130% in respect of MAF at-site.
In Figure 8, the regression plots and 0.95 confidence intervals for MAF prediction in all CCs are shown, with catchment area as the predictor variable in the regions pooled by the _org and _adj approach. The figure shows three separate sets of regression lines, according to the differences in CCs in these two clustering approaches, taking into account stations.  The selection of an appropriate attribute (independent variable) in the regression equation (14) for MAF prediction is based on the relative bias of MAF estimation via attributes A, Iavg, and Havg for region 1REG. The best agreement of MAF (the smallest relative bias range) is achieved for catchment area (A) with the highest values of R 2 (close to 0.8). Still, a relative bias at individual HSs ranges from approximately −60% to 130% in respect of MAF at-site.
In Figure 8, the regression plots and 0.95 confidence intervals for MAF prediction in all CCs are shown, with catchment area as the predictor variable in the regions pooled by the _org and _adj approach. The figure shows three separate sets of regression lines, according to the differences in CCs in these two clustering approaches, taking into account m j stations. In the course of MAF estimation for a pseudo-ungauged HS, the regression equation in its CC is redefined, i.e., regression coefficients are recalculated from the remaining mj − 1 HSs in the CC. When analyzing pseudo-ungauged HSs from CC5 and partly CC3, the regression line slope becomes illogical, resulting in a MAF decrease with a catchment area increase. CC5 consists of HSs with catchment areas ranging from 630 km 2 to over 1100 km 2 , with karst shares from 0% to 45%. Neglecting HS 46 in the CC3 also yields a negative value of the exponent b, leading to the largest MAF underestimation of all HSs. This is further discussed in Section 4.
A significant MAF relative bias is generally found in HSs with catchment area extremes (min/max) in their CCs but also in the CCs comprising similar catchment area HSs but with significantly different karst shares. The HSs with min/max catchment area within the CC gave good MAF estimation agreement if the CC comprises HSs with similar catchment areas and karst shares. However, there are HSs of similar catchment area(s) and karst share that behave differently. In both _org and _adj CCs, MAF is generally underestimated in catchments with a high karst share. Table 4 shows the consolidated efficiency measures of MAF prediction obtained for all HSs and shown per the regionalization method with the best values bolded; these are shown as components of boxplot diagrams in Figure 9. In the course of MAF estimation for a pseudo-ungauged HS, the regression equation in its CC is redefined, i.e., regression coefficients are recalculated from the remaining m j − 1 HSs in the CC. When analyzing pseudo-ungauged HSs from CC5 and partly CC3, the regression line slope becomes illogical, resulting in a MAF decrease with a catchment area increase. CC5 consists of HSs with catchment areas ranging from 630 km 2 to over 1100 km 2 , with karst shares from 0% to 45%. Neglecting HS 46 in the CC3 also yields a negative value of the exponent b, leading to the largest MAF underestimation of all HSs. This is further discussed in Section 4.
A significant MAF relative bias is generally found in HSs with catchment area extremes (min/max) in their CCs but also in the CCs comprising similar catchment area HSs but with significantly different karst shares. The HSs with min/max catchment area within the CC gave good MAF estimation agreement if the CC comprises HSs with similar catchment areas and karst shares. However, there are HSs of similar catchment area(s) and karst share that behave differently. In both _org and _adj CCs, MAF is generally underestimated in catchments with a high karst share. Table 4 shows the consolidated efficiency measures of MAF prediction obtained for all HSs and shown per the regionalization method with the best values bolded; these are shown as components of boxplot diagrams in Figure 9. The relative bias in MAF estimates by the regression equations at all HSs in the _org and _adj clusters is shown in Figure 10. The dashed red line represents the location of a supposed equal MAF relative bias obtained for an HS in its _org and _adj cluster center. The relative bias in MAF estimates by the regression equations at all HSs in the _org and _adj clusters is shown in Figure 10. The dashed red line represents the location of a supposed equal MAF relative bias obtained for an HS in its _org and _adj cluster center. The relative bias in MAF estimates by the regression equations at all HSs in the _org and _adj clusters is shown in Figure 10. The dashed red line represents the location of a supposed equal MAF relative bias obtained for an HS in its _org and _adj cluster center. The overall impression from Figure 10 is that there are more HSs with better individual results in MAF estimation from _adj clusters (14 HSs) compared to _org clusters (10 HSs), while the remaining 29 HSs (_org = _adj on the red dashed diagonal) lead to an equal MAF relative bias. The most underestimated MAFs (−70% to −50%) in both _org and _adj clusters are found for the HSs with high at-site MAF.

Regional Growth Curves
The selection of the regional growth curve distribution function is based on the Lmoment plot, i.e., on the proximity of regional values for t 3 and t 4 parameters to the line representing a theoretical function. Regional parameter values calculated from m j stations (m j = 53 in 1REG) and for all CCs in the CLUSTER method are shown in Figure 11. For most regions, point (t 3 , t 4 ) lies by the GEV line. Therefore, the GEV function is adopted for all regions.
indicates MAF at-site (m 3 /s), and marker size shows the extent of the MAF difference (%) in the _org and _adj approaches.
The overall impression from Figure 10 is that there are more HSs with better individual results in MAF estimation from _adj clusters (14 HSs) compared to _org clusters (10 HSs), while the remaining 29 HSs (_org = _adj on the red dashed diagonal) lead to an equal MAF relative bias. The most underestimated MAFs (−70% to −50%) in both _org and _adj clusters are found for the HSs with high at-site MAF.

Regional Growth Curves
The selection of the regional growth curve distribution function is based on the Lmoment plot, i.e., on the proximity of regional values for t3 and t4 parameters to the line representing a theoretical function. Regional parameter values calculated from stations ( = 53 in 1REG) and for all CCs in the CLUSTER method are shown in Figure 11. For most regions, point (t3, t4) lies by the GEV line. Therefore, the GEV function is adopted for all regions. Figure 11. Regional values for the LCs = t3 and LCk = t4 parameters, with regions pooled by 1REG (marker +) and by cluster analysis (marker o) in the _org (transparent) and _adj (nontransparent) approaches. All HSs in the CCs are considered.
Regional growth curves are constructed for all regions (e.g., dashed line for region 3 in Figure 12). Each pseudo-ungauged HS has its own regional growth curve because it is estimated from the mj − 1 HS data. Figure 11. Regional values for the LCs = t 3 and LCk = t 4 parameters, with regions pooled by 1REG (marker +) and by cluster analysis (marker o) in the _org (transparent) and _adj (nontransparent) approaches. All HSs in the CCs are considered.
Regional growth curves are constructed for all regions (e.g., dashed line for region 3 in Figure 12). Each pseudo-ungauged HS has its own regional growth curve because it is estimated from the m j − 1 HS data.
The flood information transfer is completed by MAF estimation in an ungauged HS according to its catchment area and the construction of a regional growth curve for its region. Then, for a desired return period T, QT/MAF is known from the regional growth curve, the ratio is multiplied by MAF, and the flood quantile QT is estimated. The flood information transfer is completed by MAF estimation in an ungauged HS according to its catchment area and the construction of a regional growth curve for its region. Then, for a desired return period T, QT/MAF is known from the regional growth curve, the ratio is multiplied by MAF, and the flood quantile QT is estimated.

Regional 100-Year Flood Quantile Estimates
The flood quantile estimates are analyzed for the return period T of 100 years. The results are presented for the ratio q100 = Q100/MAF obtained from the regional growth curves and for flood quantile Q100. The relative bias (%) of q100 and Q100 estimates with respect to at-site values is shown in Figures 13 and 14, respectively, as boxplots. The HSs are located in their corresponding regions/CCs in _org and _adj regionalization, with markers depicting silhouette width and region homogeneity measure H1 (HW test).

Regional 100-Year Flood Quantile Estimates
The flood quantile estimates are analyzed for the return period T of 100 years. The results are presented for the ratio q100 = Q100/MAF obtained from the regional growth curves and for flood quantile Q100. The relative bias (%) of q100 and Q100 estimates with respect to at-site values is shown in Figures 13 and 14, respectively, as boxplots. The HSs are located in their corresponding regions/CCs in _org and _adj regionalization, with markers depicting silhouette width and region homogeneity measure H1 (HW test). , and a square marker shows that the HS is found to be discordant in its region. HS numbers are by the markers. The marker's color indicates silhouette width, its size indicates whether the q100 is estimated from a homogeneous, possibly homogenous, or heterogenous region-H1 class (small, medium, and large marker, respectively), and a square marker shows that the HS is found to be discordant in its region. HS numbers are by the markers. Figure 13. Relative bias (%) in q100 estimates in (a) _org CCs and (b) _adj CCs. The marker's color indicates silhouette width, its size indicates whether the q100 is estimated from a homogeneous, possibly homogenous, or heterogenous region-H1 class (small, medium, and large marker, respectively), and a square marker shows that the HS is found to be discordant in its region. HS numbers are by the markers.

Figure 14.
Relative bias (%) in Q100 estimates in (a) _org CCs and (b) _adj CCs. The marker's color indicates silhouette width, its size indicates whether the Q100 is estimated from a homogeneous, possibly homogenous, or heterogenous region-H1 class (small, medium, and large marker, respectively), and a square marker shows that the HS is found to be discordant in its CC. HS numbers are by the markers. The marker's color indicates silhouette width, its size indicates whether the Q100 is estimated from a homogeneous, possibly homogenous, or heterogenous region-H1 class (small, medium, and large marker, respectively), and a square marker shows that the HS is found to be discordant in its CC. HS numbers are by the markers.
The range of relative bias in the q100 estimates in Figure 13 is smaller in homogeneous and possibly homogeneous regions (CC3, CC4, and CC2 for _adj), compared to the range in definitively heterogeneous regions (CC1, CC5, CC6, and CC2 for _org). The most obvious range shrinkage is in CC2 in _adj clustering (Figure 13b) compared to _org clustering (Figure 13a). A situation in Figure 14 depicting Q100 estimates relative to bias shows that after adjustment, definitely heterogeneous regions of CC1 and CC5 have a smaller range/more compact Q100 results compared to other CCs (Figure 14a compared to Figure 14b).
In Q100 estimation, apart from q100, MAF estimation plays an important role. The overall efficiency results of silhouette width _org and _adj clustering achieved for q100 and Q100 estimates is given in Table 5. Almost all considered measures are slightly better in _adj compared to _org clustering and are shown in bold. Table 5. Relative Bias (%) in regional 100-year flood quantile estimation for silhouette-width-induced clustering. Regional RMSE (RRMSE) is also shown. The best results are bolded. At the individual HS level, there are HSs with significant silhouette width but poor Q100 estimates, e.g., HS 29 in CC6 and HS 18 in CC4.

Discussion
There are two approaches in the comparison of regionalization methods: the first approach ends with the definition of homogeneous regions, and the second approach continues through information transfer down to the results of regionally assessed values [55]. The second approach is a performance-based assessment, conducted here with a focus on the silhouette-width-induced clustering outcome in flood quantile estimates. This approach has its known drawbacks due to uncertainty accumulation as the assessment advances (e.g., [17,55]).
Clustering based on silhouette width has so far been performed in other disciplines on large data samples, resulting in fewer CCs/regions compared to the number used here for flood quantile estimation [40].
Marková et al. [56] found 3 regions for flood seasonality in 556 Slovak and Austrian catchments where the optimal number of clusters was found according to the mean silhouette width (MSW). It was not shown whether these regions are homogenous.
The choice between an optimal number of regions here is also made according to maximized MSW, between possibly 3 and 7 ( Figure 5), while trying to achieve the recommended number of elements, i.e., HSs in a region for hydrological applications [17,23]. Table 6 shows realized MSWs in six regions and their size. According to the Rousseeuw's MSW range [38], region/CC2_adj has a strong clustering structure (0.51 < MSW < 0.70), while the rest of the regions have weak clustering structures (0.26 < MSW < 0.50). The sizes of the individual regions are between the recommended 5 and 20.
The region size plays an important role in homogeneity determination [17]. Small regions may lead to false homogeneity conclusions, large regions are rarely homogenous, and comparison among regions with different sizes is complicated due to the 'regional homogeneity' concept, which is different from the heterogeneity concept considered in other fields [55]. Therefore, Requena et al. [55] proposed using the Gini index (GI) as a regional homogeneity (heterogeneity) measure that overpowers HW and AD tests in many aspects. In further research, GI will be used because the results obtained for pseudoungauged HSs here have shown that good Q100 estimation results originate from both homogeneous and heterogeneous regions. In the _adj CCs, the conclusions of the HW and AD bootstrap tests are more consistent compared to the _org CCs (Table 3), and the efficiency measures of MAF (Table 5), q100, and Q100 (Table 6) are overall better, although some values are significantly underestimated or overestimated.
When strictly observing results obtained from the homogenous region CC2_adj, which also has the best MSW among all CCs and a balanced size, the range of percent bias in Q100 is −50% to 40%, with one high outlier of 100%. The resulting bias range in q100 is from-24% to 30%. Such results point to the poor performance of the catchment area as an attribute used in the flood transfer model.
A single regression was used as a transfer model, with the catchment area as an independent variable. Despite significant values of R 2 , large percent biases in MAF were observed in the subsequent jackknife procedure. It is assumed that this simple regression model is not adequate and that new variables should be introduced. The selection of new variables is not straightforward [57,58]. In a recent performance-based regionalization study for Ethiopia [59], 27 catchment attributes were considered: 9 morphologic and location, 5 soil, 12 land use, and mean annual rainfall. In the three homogeneous regions out of the four defined, MAF regression equations were developed using a total of six attributes: three morphologic, two land use, and one soil. Per each region, the MAF relative errors were 51%, 4%, and 17%. The flood quantile comparison was not shown. De Michele and Rosso [60] have shown a multi-level approach to flood frequency regionalization based on physical and statistical criteria/attributes. Flood, streamflow, and rainfall seasonality were studied and two indices were used to cluster catchments with the same flood production mechanism. In the area of Northwestern Italy, similar to the one studied here, 80 HSs were used, and 4 regions were defined according to seasonality analysis. It appears that precipitation and flood characteristics are unavoidable candidates for further attribute inclusion in regionalization. Additionally, in FFA, it is assumed that the upper limit of the Cs of flood flow is limited by the Cs of precipitation [61]. This means that not only can precipitation data be used as an attribute, but they could explain suspicious values of regionally assessed flood quantiles as well. Sometimes, in such cases, the value of the distribution parameters is limited to acceptable values [62].
The geological structure of the 53 catchments considered here is characterized by a wide range of karst content [34]. It is known that karst potentially reduces or amplifies flood peaks and influences the catchment boundary due to the complex conditions for surface and groundwater flow [63,64]. A karst share influence on MAF prediction is implicitly shown in Figure 9, where it may be observed that MAF in catchments with high karst share (>50%) is generally underestimated.
Apart from catchment attribute selection in MAF regression, the region size influences the reliability of regression coefficient estimation. It is shown here that small regions with five HSs may lead to ill-posed regression lines, such as in CC5 (Figure 8). Mc Cuen [65] suggested four times more data than the number of parameters for reliable correlation estimation. In this regard, a simple regression model with two parameters requires eight data points (HSs), plus one for the (pseudo-) ungauged HS, meaning that nine HSs is the minimum size of a region. From this perspective, CC_adj 1, 2, and 6 and CC_org 2 have reliable MAF regression equations. In the regional flood growth curve estimations, the GEV distribution function is used for all regions. According to the L-moment ratio plots (Figure 11), for three CCs (_org 5, _adj 5, and _adj 3) other functions fit better but the relative biases in q100 estimates are similar to other CCs ( Figure 13). The adoption of different regional distribution functions in regions has been performed by other researchers (e.g., [59]), and this may contribute to result improvement in future research. Due to input data issues (e.g., data gaps), the use of LP3 with the expected moments algorithm (EMA) should be reconsidered [11]. Overall better Q100 estimates were achieved in previous research [33] conducted using the Hydrologic Engineering Center's Statistical Software Package (HEC-SSP) where EMA is installed. In addition to the 53 HSs here, more HSs were included, and a study area comprising 74 HSs was delineated in the three regions. The study period was 1961-1990.
The silhouette width of an individual HS has been shown as a good indicator of belonging to the wrong region but not as an indication of q100 or Q100 estimate quality in the circumstances presented here. The silhouette widths in HS 16 and HS 52 are among those showing a high clustering structure (>0.50). When these HSs are considered pseudo-ungauged in the jackknife procedure, the region is homogeneous (Figure 7), but their q100 estimates are among the most overestimated (~50%) (Figure 13). The percent biases in q100 or Q100 estimates in the region of CC_adj 2, holding most of the highest achieved silhouette widths among HSs, are slightly less spread compared to other CCs (Figures 13 and 14). Still, the HSs with the highest silhouette widths are not those with the best q100 or Q100 estimates.

Conclusions
This study investigated the potential of silhouette-width-induced clustering according to catchment morphology similarity on the improvement of regionally based flood quantile estimates in 53 catchments situated in Bosnia and Herzegovina and Serbia. The ten HSs from Bosnia and Herzegovina are characterized by flood peak dataset gaps and the presence of karst, which are among the main issues in flood quantile estimation in practice. The HSs from Serbia were added for flood information improvement in the catchments from Bosnia and Herzegovina. The flood information transfer was performed for all catchments (HSs) using the jackknife procedure, thus showing percent bias with respect to 'true' values at each computational step. Flood quantile estimation from both homogeneous and heterogeneous regions allowed for the comparison of the extent of bias even when the flood estimation conditions were not fully met. The following is concluded: The mean silhouette width of a cluster is an indicator of region composition quality: the higher mean silhouette width of a region means less uncertainty in region homogeneity. The regions should be adjusted to comprise HSs with positive individual silhouette widths. However, high individual silhouette widths cannot indicate good performance in flood quantile estimation.

2.
Balancing region size is important. Maximizing the mean silhouette width of a region yields a small number of regions with a large number of HSs. On the other hand, large regions tend to be heterogeneous. The minimum number of HSs in the region depends on the selected information transfer method. When the regression equation is established for the assessment of mean annual flood, the minimum number of HSs in a region should be nine if a simple regression with one variable is planned, and four more HSs should be added per regression coefficient/independent variable. 3.
Despite the high values of R 2 , single regression with catchment area as an independent variable gives significant relative bias in MAF.

4.
Similarity based on morphological attributes alone tends to result in regions that provide unreliable flood quantile estimates, even if the region is homogeneous, most likely due to an absence of local climate-/flood-related attributes.

5.
Good flood estimation results at individual HSs are achieved from both homogeneous and heterogeneous regions and for those containing discordant HSs. In addition to or instead of HW and AD bootstrap tests, an additional regional homogeneity insight should be introduced in future work by, e.g., the Gini Index (GI), to increase certainty about the results obtained from homogeneous regions. 6.
The use of one distribution function (GEV) for all regions in the index-flood method does not seem to significantly influence flood quantile estimates in this study.