Is Clustering Time-Series Water Depth Useful? An Exploratory Study for Flooding Detection in Urban Drainage Systems

: As sensor measurements emerge in urban water systems, data-driven unsupervised machine learning algorithms have drawn tremendous interest in event detection and hydraulic water level and ﬂow prediction recently. However, most of them are applied in water distribution systems and few studies consider using unsupervised cluster analysis to group the time-series hydraulic-hydrologic data in stormwater urban drainage systems. To improve the understanding of how cluster analysis contributes to ﬂooding location detection, this study compared the performance of K-means clustering, agglomerative clustering, and spectral clustering in uncovering time-series water depth dissimilarity. In this work, the water depth datasets are simulated by an urban drainage model and then formatted for a clustering problem. Three standard performance evaluation metrics, namely the silhouette coe ﬃ cient index, Calinski–Harabasz index, and Davies–Bouldin index are employed to assess the clustering performance in ﬂooding detection under various storms. The results show that silhouette coe ﬃ cient index and Davies–Bouldin index are more suitable for assessing the performance of K-means and agglomerative clustering, while the Calinski–Harabasz index only works for spectral clustering, indicating these clustering algorithms are metric-dependent ﬂooding indicators. The results also reveal that the agglomerative clustering performs better in detecting short-duration events while K-means and spectral clustering behave better in detecting long-duration ﬂoods. The ﬁndings of these investigations can be employed in urban stormwater ﬂood detection at the speciﬁc junction-level sites by using the occurrence of anomalous changes in water level of correlated clusters as ﬂood early warning for the local neighborhoods. In this study, clustering algorithms, including KC, agglomerative (AC), and spectral clustering (SC), are applied for the urban ﬂood tracking. A storm water management model (SWMM) is established to represent the real-world stormwater urban drainage systems, located in Sugar House neighborhood, Salt Lake City, UT, USA. Three evaluation indices are used to test the performance analysis of the clustering algorithms, namely SCI, CHI, and DBI. The whole research is driven by the hypothesis that the clustering of time-series water level data has the potential to facilitate ﬂooding location detection in the Sugar House Area. The investigations provide answers to various inter-related research questions: (1) What is the performance of di ﬀ erent clustering algorithms in capturing the ﬂoods? (2) Which metrics are the most suitable for assessing cluster model performance based on hydraulic-hydrologic data in UDSs? (3) Which features of ﬂood time-series data (length, volume and variability) are the most inﬂuential for ﬂooding detection, and how does the choice of data feature a ﬀ ect the clustering performance in localizing the ﬂooding sites?


Introduction
Urban drainage systems (UDSs) are the infrastructures constructed to provide conveyance ability and storage capability for drainage overflow mitigation, surface inundation reduction, and pollutant removal. However, the existing UDSs, whose functionality can only serve for a limited number of years, might degrade and even deteriorate as time goes by [1]. In recent years, retrofitting the traditional UDSs with water-level sensors, velocity meters, and flow sensors has been widely adopted as an adaptive and cost-effective solution for flooding challenges [2,3]. The deployed sensors can measure the water quantity and quality data in a real-time way, which now makes it feasible for decision-makers and stakeholders to foresee the potential flood events and locate the vulnerable sites, which supports In this study, clustering algorithms, including KC, agglomerative (AC), and spectral clustering (SC), are applied for the urban flood tracking. A storm water management model (SWMM) is established to represent the real-world stormwater urban drainage systems, located in Sugar House neighborhood, Salt Lake City, UT, USA. Three evaluation indices are used to test the performance analysis of the clustering algorithms, namely SCI, CHI, and DBI. The whole research is driven by the hypothesis that the clustering of time-series water level data has the potential to facilitate flooding location detection in the Sugar House Area. The investigations provide answers to various inter-related research questions: (1) What is the performance of different clustering algorithms in capturing the floods? (2) Which metrics are the most suitable for assessing cluster model performance based on hydraulic-hydrologic data in UDSs? (3) Which features of flood time-series data (length, volume and variability) are the most influential for flooding detection, and how does the choice of data feature affect the clustering performance in localizing the flooding sites?
To answer these questions, it is necessary to explore how UMLA groups time-series water depth data, and which assessment score can best represent UMLA performance. However, challenges to implement UMLA with time-series data still exist. Firstly, it is essential to re-format the time-series water depth datasets to make them suitable for clustering problem. This difficulty is associated with the second research question above since the features of datasets determine how we re-structure the data frame [39]. Secondly, the connection between the number of clusters and the clustering model performance is another obstacle. As it is still unknown how to correlate clustering performance and the number of clusters in the stormwater systems, it is necessary to build such a theoretical relationship for a practical application like the flooding detection herein [40]. Therefore, the study aims to improve the understanding of how UMLA facilitates detecting hydraulic anomaly according to the characteristics of water depth datasets in urban drainage networks.
The layout of the study is as follows: (1) build KC, AC, and SC algorithms to group the time-series water depth data; (2) use UMLA metrics, including SCI, CHI, and DBI, to evaluate these algorithms; (3) compare the best number of clusters obtained by each method; (4) investigate the relationship between model performance of flooding detection and water depth data characteristics (see Figure 1 for details). We start by describing the implementation of different UMLA methods, followed by the research methodology with an overview of the real-world case study, performance metrics, and simulation scenarios for cluster analysis. Then, we present the results, discussion, and finally, the conclusions.

Materials and Methods
This study was organized in four steps: (i) time-series data preprocessing; (ii) clustering modeling implementation; (iii) clustering performance assessment; (iv) applications analysis of clustering results for urban floods detection. The workflow of the methods can be found in Figure 1.

Materials and Methods
This study was organized in four steps: (i) time-series data preprocessing; (ii) clustering modeling implementation; (iii) clustering performance assessment; (iv) applications analysis of clustering results for urban floods detection. The workflow of the methods can be found in Figure 1.

Description of Unsupervised Machine Learning Algorithms
Current machine learning techniques mainly fall into two groups: supervised and unsupervised learning [41]. The UMLA is a self-organization method to find patterns in unlabeled data. Cluster analysis is, a subset of UMLA methods, and in general, is based on the principle of grouping similar observations and segmenting dissimilar observations [42]. Anomalous data points that differ from others may then be filtered [43]. A large number of clustering algorithms exist, including K-means, Affinity Propagation, and Mean Shift. In this research, we employed the SCI, CHI, and DBI to assess the performance of the cluster, because of their accuracy and wide applicability in a similar type of studies [44][45][46].

K-Means Clustering
K-means clustering (KC) is a centroid-based unsupervised clustering algorithm, originally designed for signal processing. It is the most widely applied method of cluster analysis in data mining [33]. K-means aims to partition the inputs into k clusters. Given a set of observations (x 1 , x 2 , ..., x i ) for p variables, the algorithm runs as follows: (1) Choose k initial centroids, each defined by a value for each of the p variables. These are chosen randomly, often by simply choosing k observations. (2) Assign each observation to the centroid it is most similar to. The similarity is generally measured as the Euclidean distance between the observation and centroid in parameter space. (3) Once all observations are assigned, re-estimate the centroids location as the mean of the p variables of all observations assigned to that centroid. (4) Repeat until the algorithm stabilizes (minimize the within-cluster sum of squares).
The goal then is to minimize kC the within-cluster sum of squares: where k is the number of cluster centers and {µ }, = 1, . . . , k are the cluster centroids C µ µ C . The total intra-cluster distance is the total squared Euclidean distance from each point to the center of its cluster, and this is a measure of the variance or internal coherence of the clusters [47]. This can be used to assess the stability of the solution. When this falls below a predefined threshold, the algorithm stops. The algorithm is often run multiple times with different random initialization of cluster centroids to avoid sub-optimal problems in convergence. The clustering solution with the lowest sum-of-squares is chosen as the final output. However, the choice of k is challenging when model performance metrics are not available. Often, an initial value of k is chosen, then the algorithm is repeated for higher and lower values. To improve the efficiency of discovering the best k value, a score (SCI, CHI, DBI)-based performance assessment method is recommended in many prior studies [42].

Agglomerative Clustering
Agglomerative clustering (AC) is one of the main forms of hierarchical clustering. These algorithms do not provide a single partitioning of the data but instead provide a full hierarchy of cluster solutions from all observations in a single cluster (i.e., k = 1) to all observations in individual clusters (i.e., k Water 2020, 12, 2433 5 of 23 = n) [48]. In contrast to KC, hierarchical methods allow existing clusters to be split or merged, with the result that smaller clusters are related to large clusters in a hierarchy. The rules governing which clusters are again based on their distance or similarity. The AC algorithm consists of the following steps: (1) Start with each data point as its own cluster.
(2) Select the distance metric and linkage criteria to calculate the dissimilarity between pairs of observations. (3) Link together the two clusters with the minimum dissimilarity. (4) Continue this process until there is only one cluster.
A key decision in the AC algorithm is the calculation of dissimilarity between clusters. In this study, we used Euclidean distance [47], and the Ward linkage, which measures the distance between the cluster centroids, similar to the K-means clustering method. The equations for Euclidean distance and Ward linkage are defined by Equations (2) and (3), respectively: where a and b mean the Euclidean vector; a i and b i are the point position for the Euclidean vector; i is the number of vectors.
where d ij is the squared Euclidean distance between point i and point j; X i and X j are Ward's vectors. The resulting hierarchy of clusters can be represented using a dendrogram plot [48]. The detailed introduction of the dendrogram plot can be found in Section 2.3.5 below.

Spectral Clustering
Spectral clustering (SC) is an unsupervised learning technique based on graph theory, where SC takes advantage of graph information from the spectrum to find the number of clusters [49]. Unlike the previous methods that tend to prioritize clusters by proximity, SC aims to identify observations that are linked, and therefore may not form classical spherical groups in parameter space. The SC algorithm is as follows: (1) Create a similarity matrix S between observations. This is the complement to the dissimilarity matrices used in other methods, and here is calculated as the negative Euclidean distance. (2) Create an adjacency matrix A, representing the graph or connectivity between observations. This is a transformation of S, where for each observation, we find the k nearest neighbors (i.e., with the highest similarity). If observations i and j are considered to be neighbors, we set A ij = S ij . If not, we set A ij = 0. (3) Create a degree matrix D, where the diagonal values are the degree of connectivity for each observations, given as diag{D} = n i,j A ij, i,j=1, 2,3, ..., n (4) Next, calculate the graph Laplacian matrix L. This can be normalized or unnormalized. Here, we use the unnormalized: L = D − A (5) The clustering solution is then found by eigendecomposition of the Laplacian, and selecting the k smallest eigenvectors. Consequently, these result in a perfect separation of the observations. K-means is then run on these eigenvectors, to get the final cluster assignment of each observation: As SC performs dimensionality reduction before clustering data points, it is a very flexible approach for complex data sets. However, the similarity matrix generated by SC may include negative values, which can be problematic for grouping time-series points.

Summary and Comparison of Clustering Algorithms
In general, it is difficult to recommend a single algorithm as being the most suitable for clustering, particularly with data that is uncertain and of poor quality, such as the features of pipe flow or water level data used here [41]. It is, therefore, advisable to use several algorithms and compare their performance for specific applications. Here, we use KC, SC, and AC to discover the unknown subgroups in simulated water depth data of UDSs' junctions. Table 1 summarizes the advantages and disadvantages of these algorithms from review papers [24,33,44]. Table 1. Clustering algorithm information summary.

K-means Clustering
A kind of vector quantization, partition data points into clusters by minimizing the intra-cluster distance.

Agglomerative Clustering
A kind of hierarchical clustering for merging clusters according to a measure of data dissimilarity.

Spectral Clustering
A kind of graph clustering based on the distances between points.

Clustering Model Implementation
The SWMM model was run six times, once with each of the rainfall scenarios described above. We collected the simulated time-series water depth from each node in the stormwater drainage network for cluster analysis. As there are 60 junctions in the SWMM model, this results in a matrix where each column represents a single time step with a 5-min interval, and each row (60 rows) stands for a junction or node in the network. We then used the principal component analysis (PCA) to reduce the dimensionality of this matrix. PCA uses the eigendecomposition of the correlation matrix to identify a small set of principal components that represent the majority of variance in the original data [50]. Here, we used correlations between the time-series at different nodes to reduce the column of matrix to 2, which means the number of timesteps is compressed to 2 principal components. Finally, the dataset matrix is configured with 60 rows and 2 columns under each modeling scenarios. The datasets used in this work are not large, and for computational costs are limited. While other techniques for data reduction exist (e.g., correspondence analysis, factor analysis, or non-metric multi-dimensional scaling), we used PCA due to the assumed linear response of the water depth values. Although the reduction of dimensionality might cause data loss or an undesirable relationship between score axes, PCA indeed helps reduce computation time and remove redundant data features in the following cluster analysis.
All clustering algorithms were then run using this set of two principal components shown in Figure 2, with the following set up: (1) K-means: We initially set the number of clusters (k) to 2 for each modeling scenarios. The algorithm was repeated ten times with different random initialization, and a maximum of 5 iterations was used to converge the algorithm.
(2) Agglomerative clustering model: We used Ward linkage, as this is robust to outliers and unequal variance in the data. As only 'Euclidean' supports 'Ward' linkage distance computation. If 'Ward' linkage is used for cluster distance computation, 'Euclidean' would be the best way to measure the data dissimilarity [51]. Thus, the cluster distance calculation method and dissimilarity metric among sample points are set to be 'Ward' and 'Euclidean' distance, respectively. The resulting hierarchy was cut to provide 2 clusters. (3) Spectral clustering: The algorithm was used to identify 2 clusters, using the unnormalized graph Laplacian.
algorithm was repeated ten times with different random initialization, and a maximum of 5 iterations was used to converge the algorithm. (2) Agglomerative clustering model: We used Ward linkage, as this is robust to outliers and unequal variance in the data. As only 'Euclidean' supports 'Ward' linkage distance computation. If 'Ward' linkage is used for cluster distance computation, 'Euclidean' would be the best way to measure the data dissimilarity [51]. Thus, the cluster distance calculation method and dissimilarity metric among sample points are set to be 'Ward' and 'Euclidean' distance, respectively. The resulting hierarchy was cut to provide 2 clusters. (3) Spectral clustering: The algorithm was used to identify 2 clusters, using the unnormalized graph Laplacian.
In Figure 2 below, there is no sample marginal overlapping, which indicates the cluster classification is reasonable with respect to grouping the time-series water level data. Additionally, the isolated dots in the subplots of Figure 2 present the dissimilarity of the water depth datasets under this event, indicating these isolated dots might be the potential flooded junctions, which help the decision-makers to pre-screen the vulnerable sites in the drainage networks.  In Figure 2 below, there is no sample marginal overlapping, which indicates the cluster classification is reasonable with respect to grouping the time-series water level data. Additionally, the isolated dots in the subplots of Figure 2 present the dissimilarity of the water depth datasets under this event, indicating these isolated dots might be the potential flooded junctions, which help the decision-makers to pre-screen the vulnerable sites in the drainage networks.

Clustering Model Evaluation and Validation
Unlike the supervised machine learning algorithms that compare the predicted and actual values to compute the model accuracy, the UMLA assess performance directly on the characteristics of the clusters that were obtained. The performance then depends on data features selected, data preprocessing, and parameter settings such as the distance function to use, a density threshold, or the number of expected clusters, which can be modified according to the varying datasets and object inputs. As a result, there is rarely a single obvious solution for clusters, and cluster analysis is an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure, aimed to obtain the desired results [52][53][54][55].
Several indices, including SCI, DBI and CHI, are employed to measure the relative performance of clustering algorithms. In general, these metrics provide an assessment of how the data variance is partitioned. An ideal cluster solution will have low intra-cluster variance (i.e., all observations should be similar within a cluster) and high inter-cluster variance (the clusters should be well separated).

Silhouette Coefficient Index
The silhouette coefficient index is an example of model-self-evaluation, where a higher SCI score relates to a model with better-defined clusters [56]. This score is bounded between −1 for incorrect clustering and +1 for well-formed clusters. Scores around zero indicate overlapping clusters. The SCI is defined for each observation, which can be calculated as Equation (4): where the SCI is for a single observation; m is the mean distance between an observation and all other observations in the same class; n is the mean distance between the same observation and all observations in the next nearest cluster. The SCI has the advantage that it can be used to examine how well individual observation are clustered, or an estimate can be obtained for each cluster or for the whole cluster solution by averaging across a cluster or the entire dataset, respectively. An estimate can be obtained for each cluster or for the whole clusters solution. A set of samples is given as the mean of the SCI for each sample, and it would be relatively higher when clusters are dense and well separated [57].

Calinski-Harabasz Index
The CHI is calculated as the ratio of the between-clusters dispersion average and the within-cluster dispersion [58], penalized by the number of clusters (k). A higher CHI score indicates better-defined clusters (i.e., dense and well separated). CHI for a set of k clusters is calculated as: where N is the number of points in our data; k is the number of the cluster; T r represents dispersion matrix; B k is the between-group dispersion matrix, and W k is the within-cluster dispersion matrix. B k and W k are defined by the following equations: where C q is the set of points in the cluster q, c q is the center of the cluster q, c is the center of the whole data set which has been clustered into k clusters, n q is the number of points in the cluster q.

Davies-Bouldin Index
The DBI can also be used to evaluate the model, where a lower DBI relates to a model with better separation between the clusters [59]. The index is defined as the average similarity (R ij ) between each cluster k and the next closest (i.e., most similar) cluster. The DBI is calculated as Equation (8): where DBI is the Davies-Bouldin index. Zero is the lowest possible score. Values closer to zero indicate a better partition. k is the number of the cluster. R ij is the similarity measure which features per Equation (9): where s i is the average intra-distance between each point of cluster i and the centroid of that cluster representing as cluster diameter; d ij is the inter-cluster distance between cluster centroids i and j; R ij is set to the trade-off between inter-cluster distance and intra-cluster distance. The computation of DBI is simpler than that of SC since this index is computed only with quantities and features inherent to the dataset [60]. However, a good value reported by DBI might not imply the best information retrieval [55].

Intra-Cluster Distance
Intra-cluster distance (ICD) is the distance between two samples belonging to the same cluster. Three types of intra-cluster distance, including complete diameter distance, average diameter distance, and centroid diameter distance, are popular in prior studies. As the number of clusters increase, individual clusters become more homogenous, and the ICD decreases. At a certain point, the decrease in distances becomes negligible. Plotting this distance against k usually results in an inflection point or elbow point where this occurs, and can be used to identify the optimal value of k [61]. The number of clusters is chosen at this point, hence the "elbow criterion." Here we use the centroid distance to represent ICD, given as double the average distance between all of the objects: where ∆(S) is the centroid diameter distance of the formed cluster representative S; x is the samples belonging to cluster S; d(x, T) is the distance between two objects, x and T; |S| is the number of objects in cluster S.

Dendrogram
A dendrogram is a visualization in the form of a tree that shows the hierarchical relationship like the order and distance (dissimilarity) between samples [62]. The individual samples are located along the bottom of the dendrogram and referred to leaf nodes. The hierarchical clusters are formed by merging individual samples or existing lower-level clusters. In a dendrogram, the vertical axis is labeled distance and refers to a dissimilarity measure between individual samples or clusters. Generally, in a dendrogram, horizontal lines can be regarded as places where clusters merge, while vertical lines show the distance at which lower-level clusters were merged, forming a new higher-level cluster. The dissimilarity measure between two groups is calculated as Equation (12): where Dis means the dissimilarity or distance among objects and C means the correlation degree between clusters. If clusters are highly correlated to each other, they will have a correlation value close to 1. To that, Dis = 1 − C will be given a value close to zero. Therefore, highly related clusters are nearer to the bottom of the dendrogram. Those clusters that are not correlated have a correlation value close to zero. Clusters that are negatively correlated will give a distance value larger than 1 in the dendrogram. The dendrogram can be used to visually allocate correlated objects to clusters or to detect outliers and anomaly in a diagram [47]. In the dendrogram, each sample is treated as a single cluster and then successively combines pairs of clusters until all clusters have been merged into a single cluster. In this process, the dendrogram shows how the aggregations are performed from bottom to top of the dendrogram statically. This procedure allows the cut-off points to flexibly and efficiently represent the number of clusters. Therefore, this study used the number of cut-off points in the dendrogram to validate the cluster number of the agglomerative clustering.

Study Area and Data Description
A real-world urban stormwater system located in Salt Lake City, UT, U.S., was selected as the case study, shown in Figure 3. This study case, with an area of 81-ha, is semi-arid, and has soil composed of four primary types: alluvial fan, artificial fill, silt and clay, and sand and gravel deposits. The soil surrounding the study area is classified as hydrologic soil groups B and C, with low infiltration capacity, which has a relatively poorly draining surface. Due to climate change and urbanization, the studied area has suffered from floods more frequently than 1990s, and the increase in the magnitude and duration of the storm events has pushed the resulting stormwater system out of service. This urban drainage network was represented by a rainfall-runoff SWMM model. SWMM is a state-of-art tool developed to help support local, state, and national stormwater management objectives to reduce runoff, discharge, and improve stormwater quality [63,64]. It has been widely used all over the world in similar type of investigations including stormwater runoff, combined and sanitary sewers, and other drainage systems [65][66][67]. Figure 3 shows the components of this SWMM model, which includes one rain gauge, 60 junctions, 61 conduits, two outfalls, and seven sub-catchments, while the groundwater interflow, water evaporation, snowmelt, and manhole hydraulic loss are neglected during the simulation [68].   For this study, we created 6 artificial precipitation series according to the Chicago distribution method in PCSWMM v.7.3, and then imported them as modeling inputs. The distribution for the synthetic rains is shown in Figure 4. These rainfalls with durations of 3 h, 12 h, to 48 h and return periods ranging from 2-year to 5-year almost contain all typical features and characteristics of real storms in the study area. Additionally, rainfall measurements for two real rainfall events were collected to test the clustering algorithm. These rain records from 5 May 2015 rainfall event and 8 July 2015 rainfall event are representative for the typical real storms under average climatic conditions in the study area. Compared with water depth generated by the artificially designed rainfall data, the time-series water depth produced by the real-world storms contains more non-stationarity and noise. Nevertheless, the obtained findings are subsequent validated with real rain records.
Water 2020, 12, x FOR PEER REVIEW 12 of 24 Figure 4. Distribution plots of artificially designed rainfalls with different return periods and rainfall duration.

K-Means
A detailed investigation was carried out to assess the performance of the clustering algorithms. Figure 4 shows how three performance metrics SCI, CHI and DBI change with different cluster numbers when using K-means to cluster the time-series water depth data. Values for the CHI value increase with higher cluster numbers, whereas the SCI and DBI values fluctuate. The SCI and DBI values show opposite trends, reflecting the different methods by which they are calculated (see Section 2.3 above). In particular, Figure 5b,c show that the best solution is with eight clusters, reflected in the largest SCI value and smallest DBI value. These results suggest that the SCI and DBI are more suitable to assess the performance of K-means, while any peak in the CHI related to cluster quality is eclipsed by the influence of increasing the number of clusters. Based on the SCI and DBI value in Figure 5a, the optimal number of clusters is six for the two year-3 h and five year-3 h rainfall scenarios. The differences in the optimal number of clusters in Figure 5a-c indicate that rainfall duration has impacts on the number of clusters when utilizing K-means to group time-series water depth datasets.

K-Means
A detailed investigation was carried out to assess the performance of the clustering algorithms. Figure 4 shows how three performance metrics SCI, CHI and DBI change with different cluster numbers when using K-means to cluster the time-series water depth data. Values for the CHI value increase with higher cluster numbers, whereas the SCI and DBI values fluctuate. The SCI and DBI values show opposite trends, reflecting the different methods by which they are calculated (see Section 2.3 above). In particular, Figure 5b,c show that the best solution is with eight clusters, reflected in the largest SCI value and smallest DBI value. These results suggest that the SCI and DBI are more suitable to assess the performance of K-means, while any peak in the CHI related to cluster quality is eclipsed by the influence of increasing the number of clusters. Based on the SCI and DBI value in Figure 5a, the optimal number of clusters is six for the two year-3 h and five year-3 h rainfall scenarios. The differences in the optimal number of clusters in Figure 5a-c indicate that rainfall duration has impacts on the number of clusters when utilizing K-means to group time-series water depth datasets.   Figure 6 shows the same results but based on the use of Agglomerative Clustering (AC) to group the time-series water depth data. As with the K-means results ( Figure 5), the CHI value increase with the number of clusters for all scenarios from short-duration to long-duration rainfall. Again, it is difficult to identify an optimal number of clusters, and this suggests that the CHI is not suitable for ascertaining the best clustering solution with these data. In contrast, the SCI and DBI show clear peaks  Figure 6 shows the same results but based on the use of Agglomerative Clustering (AC) to group the time-series water depth data. As with the K-means results ( Figure 5), the CHI value increase with the number of clusters for all scenarios from short-duration to long-duration rainfall. Again, it is difficult to identify an optimal number of clusters, and this suggests that the CHI is not suitable for ascertaining the best clustering solution with these data. In contrast, the SCI and DBI show clear peaks in their values. Figure 6a shows that 16 clusters result in the maximum SCI close to 0.76 and minimum DBI with 0.38. Figure 5c shows a peak in SCI values (~0.6) for eight clusters, with a corresponding minimum in the DBI value (<0.4). However, Figure 6b shows that eight clusters could produce the largest SCI (~0.62) and the lowest DBI (~0.40) with the two year-12 h rainfall duration scenario (left subplot), but that 16 clusters are the optimal solution for the two year-12 h rainfall (SCI~0.58 and DBĨ 0.38; right subplot). In summary, the best cluster solutions AC algorithms are 16, eight, and eighteen under 3 h, 12 h, and 48-h duration rainfalls, respectively. Comparing the left subplots with the right subplots ( Figure 6) provides evidence that the cluster number for the best AC performance remains the same, although the return period has been shifted from two-year to five-year. The rainfall return period (annual exceedance probability) was found to be less related to the number of clusters. in their values. Figure 6a shows that 16 clusters result in the maximum SCI close to 0.76 and minimum DBI with 0.38. Figure 5c shows a peak in SCI values (~0.6) for eight clusters, with a corresponding minimum in the DBI value (<0.4). However, Figure 6b shows that eight clusters could produce the largest SCI (~0.62) and the lowest DBI (~0.40) with the two year-12 h rainfall duration scenario (left subplot), but that 16 clusters are the optimal solution for the two year-12 h rainfall (SCI ~0.58 and DBI ~0.38; right subplot). In summary, the best cluster solutions AC algorithms are 16, eight, and eighteen under 3 h, 12 h, and 48-h duration rainfalls, respectively. Comparing the left subplots with the right subplots ( Figure 6) provides evidence that the cluster number for the best AC performance remains the same, although the return period has been shifted from two-year to five-year. The rainfall return period (annual exceedance probability) was found to be less related to the number of clusters.   Figure 7 shows the results obtained for different cluster numbers using Spectral Clustering to group the time-series water depth data. In contrast to the two previous methods, the SCI values decrease as the number of clusters increase. For the 12 and 48 h scenarios, this index identifies solutions at about 6-7 clusters, but no clear optimal solution is identified in the shorter scenarios (panel a). This suggests that this index is unsuitable for assessing this algorithm. The DBI values show greater variation as the number of clusters change, although minima can be observed at 6 to 7 clusters for most scenarios. The CHI values no longer show a linear increase, but show clear peaks, although usually for higher numbers of clusters than the DBI identifies.

Clustering Performance Testing
The analysis of cluster performance in the previous section is based on synthetic rainfall datasets, due to lack of water depth data in the drainage network. However, the use of noise-free synthetic data may have a significant impact on the results obtained [69], and our results may not represent real storm situations or current climate conditions. In contrast, the trends identified here might be masked by time series noise, making it more difficult to identify optimal solutions. In order to validate that the results obtained from designed rainfalls can also be applied to non-stationary real-storms, we evaluate the performance of the clusters in grouping flooding water depth datasets generated by two real flood events described below.
The left plot in Figure 8 indicates that the best number of clusters for the 5 May 2015 event ( Figure 8a) and 8 July 2015 event (Figure 8b) are five and four, respectively. Increasing the number of clusters beyond this causes both the SCI and the DBI to decline. The distribution of different clusters obtained is shown in the PCA plots in the right panel of Figure 7. These show that the cluster analysis resulted in a good separation of the storm events (indicated by the lack of overlap between the gray circles).It should be noted that both subplots 8a and 8b have an isolated cluster on the top. This is the only cluster composed of one sample, which means the water depth from the corresponding junction is significantly distinguishable to others. One possible reason for this phenomena is that the flooding or overflow events have occurred, triggering a very different signal in water depth at this location. Besides, as the rainfall duration increases from 3 h (the 5 May 2015 storm) to 24 h (the 8 July 2015 storm), the reduction in the number of clusters selected is in line with the results of Section 4, supporting the negative correlation between the number of clusters and event duration.

Clustering Performance Testing
The analysis of cluster performance in the previous section is based on synthetic rainfall datasets, due to lack of water depth data in the drainage network. However, the use of noise-free synthetic data may have a significant impact on the results obtained [69], and our results may not represent real storm situations or current climate conditions. In contrast, the trends identified here might be masked by time series noise, making it more difficult to identify optimal solutions. In order to validate that the results obtained from designed rainfalls can also be applied to non-stationary realstorms, we evaluate the performance of the clusters in grouping flooding water depth datasets generated by two real flood events described below.
The left plot in Figure 8 indicates that the best number of clusters for the 5 May 2015 event ( Figure 8a) and 8 July 2015 event (Figure 8b) are five and four, respectively. Increasing the number of clusters beyond this causes both the SCI and the DBI to decline. The distribution of different clusters obtained is shown in the PCA plots in the right panel of Figure 7. These show that the cluster analysis resulted in a good separation of the storm events (indicated by the lack of overlap between the gray circles).It should be noted that both subplots 8a and 8b have an isolated cluster on the top. This is the only cluster composed of one sample, which means the water depth from the corresponding junction is significantly distinguishable to others. One possible reason for this phenomena is that the flooding or overflow events have occurred, triggering a very different signal in water depth at this location. Besides, as the rainfall duration increases from 3 h (the 5 May 2015 storm) to 24 h (the 8 July 2015 storm), the reduction in the number of clusters selected is in line with the results of Section 4, supporting the negative correlation between the number of clusters and event duration.

Cluster Number Validation
The dendrogram plots are also used to validate the number of clusters. Figure 9 shows the dendrogram plots obtained from applying the AC algorithm to the flooding water depth data. Generally, the cut-off point should be at least 70% dissimilarity between two clusters or cutting where the dendrogram difference is most significant [69]. The number of clusters was selected by using a distance threshold of 0.9 distance or 90% dissimilarity, and this is plotted as a horizontal cut-off line in all dendrograms of Figure 9. The cross points (highlighted as green X in dendrogram) between the cut-off line and dendrogram leaves identify the accepted clusters. In Figure 9, one point identified by the cut-off line (junction 8; highlighted as red X in dendrogram) was considered as an outlier in the dendrogram and excluded. In practice, this algorithm might be helpful for anomaly detection in the sensor monitoring network. For instance, real-time monitoring is built to capture the varying different features of measurements as much as possible within a limited number of sensors [70,71]. Further, the clusters represent different parts of the hydrological network and can be used to help target locations for sensor deployment to observe overflow and flood events in the field.
Water 2020, 12, x FOR PEER REVIEW 17 of 24

Cluster Number Validation
The dendrogram plots are also used to validate the number of clusters. Figure 9 shows the dendrogram plots obtained from applying the AC algorithm to the flooding water depth data. Generally, the cut-off point should be at least 70% dissimilarity between two clusters or cutting where the dendrogram difference is most significant [69]. The number of clusters was selected by using a distance threshold of 0.9 distance or 90% dissimilarity, and this is plotted as a horizontal cut-off line in all dendrograms of Figure 9. The cross points (highlighted as green X in dendrogram) between the cut-off line and dendrogram leaves identify the accepted clusters. In Figure 9, one point identified by the cut-off line (junction 8; highlighted as red X in dendrogram) was considered as an outlier in the dendrogram and excluded. In practice, this algorithm might be helpful for anomaly detection in the sensor monitoring network. For instance, real-time monitoring is built to capture the varying different features of measurements as much as possible within a limited number of sensors [70,71]. Further, the clusters represent different parts of the hydrological network and can be used to help target locations for sensor deployment to observe overflow and flood events in the field.   The vertical comparisons among the subplots of Figure 9a-c disclosed that the appropriate cluster numbers for 3 h, 12 h and 48 h rainfall scenarios are quite similar: eight, nine, and nine, respectively. Meanwhile, comparing cluster solutions for different time periods (e.g., left and right plot of Figure 9a, the number of clusters and their structure is remarkably similar, implying that the event return period has fewer impacts on AC model performance. This supports the conclusions reached with the synthetic time series, that the AC model performance noticeably depends on the flooding duration but not the event return period (exceedance probability).
This study adopted intra-cluster distance as the metric to assess the effects of flooding duration and return period (exceedance probability) on the performance of the K-means and Spectral Clustering algorithm. Figure 10 shows the results of this comparison, with the decay in the intra-cluster distance as the number of clusters increases. A notable elbow point (the cross between red dashed line and intra-distance curves) can be seen at the four clusters, as the decrease in distances becomes much smaller. Using the elbow criterion described in Section 2.3.4, this suggests that four clusters are the best solution. Increasing the number of clusters beyond this would result in a little additional gain for the extra complexity of the solution. Figure 10 shows that the intra-cluster distance changes in a similar way for all six rainfall scenarios, and that the intra-cluster distance is close in those rainfalls with the same duration. For example, the solid purple line with purple circle markers (representing two year-3 h rainfall scenario) overlaps the red dashed line with the red circle markers (representing five year-3 h rainfall scenario). However, there are still some differences between scenarios with different rainfall duration. Notably, the intra-cluster distance increases as the rainfall duration decreases (the distance for the '3 h' duration rainfall is the largest, followed by the '12 h' cases, and then the '48 h' scenarios). As a metric for clustering performance, intra-cluster distance is therefore useful in determining how well these algorithms group the water depth time-series. These results suggest that the K-means and spectral clustering algorithms work best with longer duration rainfalls, implying that the longer event duration produces greater similarity in the water depth at different junctions. This, coupled with the larger set of observations from a longer period, results in better formed individual clusters. Wu et al. have shown that these cluster methods work optimally when trained on massive datasets, which is supported by the results herein [72]. This study adopted intra-cluster distance as the metric to assess the effects of flooding duration and return period (exceedance probability) on the performance of the K-means and Spectral Clustering algorithm. Figure 10 shows the results of this comparison, with the decay in the intracluster distance as the number of clusters increases. A notable elbow point (the cross between red dashed line and intra-distance curves) can be seen at the four clusters, as the decrease in distances becomes much smaller. Using the elbow criterion described in Section 2.3.4, this suggests that four clusters are the best solution. Increasing the number of clusters beyond this would result in a little additional gain for the extra complexity of the solution. Figure 10 shows that the intra-cluster distance changes in a similar way for all six rainfall scenarios, and that the intra-cluster distance is close in those rainfalls with the same duration. For example, the solid purple line with purple circle markers (representing two year-3 h rainfall scenario) overlaps the red dashed line with the red circle markers (representing five year-3 h rainfall scenario). However, there are still some differences between scenarios with different rainfall duration. Notably, the intra-cluster distance increases as the rainfall duration decreases (the distance for the '3 h' duration rainfall is the largest, followed by the '12 h' cases, and then the '48 h' scenarios). As a metric for clustering performance, intra-cluster distance is therefore useful in determining how well these algorithms group the water depth time-series. These results suggest that the K-means and spectral clustering algorithms work best with longer duration rainfalls, implying that the longer event duration produces greater similarity in the water depth at different junctions. This, coupled with the larger set of observations from a longer period, results in better formed individual clusters. Wu et al. have shown that these cluster methods work optimally when trained on massive datasets, which is supported by the results herein [72].

Clustering Parametric Discussion
Previous cluster-based studies have mainly focused on detecting pressure, demand, pipe burst, infrastructure damage, and illicit intrusion in water distribution systems [71][72][73]. In the cluster analysis here, the features, such as the length of time-series water depth from UDSs, are found to be negatively correlated with the number of clusters. This finding has been validated by the dendrogram cut-off points in those designed rainfalls and also by the cluster center mapping based on real storm events. The similar results between the artificial (noise-free) and practical (noise-polluted) scenario infer that event duration (data length) overwhelms the event exceedance probability (data magnitude) in the cluster number identification, which agrees with the findings from [25,72]. Increasing the number of clusters often results in many more errors. One extreme case is that the zero error happens when each data point is equal to every cluster. Intuitively, the choice of the best number of clusters can be interpreted into a trade-off between the maximum reduction of complexity of the data with a single cluster and maximum accuracy by assigning each data point to its cluster. For long time series, we suggest starting with a small number of clusters and increasing the number, testing the performance at each increase.
In addition to the determination of the number of clusters, the structure of datasets may also affect the clustering model performance. KC and SC algorithms are able to robustly group water depth datasets from longer duration flood events. However, there is a limited relationship between algorithm performance and annual exceedance probability. The sharply rising trend (Figures 4-6) demonstrates that the CHI is not suitable to identify the best number of clusters in the KC and AC algorithms, but that the SCI and DBI work quite well and give comparable results (Figures 4-6). In contrast, the CHI works well in identifying the optimal cluster number with the SC algorithm. This difference reflects the different nature of the algorithms: KC and AC are based on simple dissimilarity measures between observations, whereas the SC is based on a graph representing connectivity. This is because that DBI evaluates intra-cluster similarity among every data point and inter-cluster differences among each group. Similarly, the SCI measures the distance between each data point and the centroid of the cluster it was assigned to. An SCI value close to 1 is always good, and a DBI value close to 0 is also good whatever clustering you are trying to evaluate. However, the CHI is not normalized, and it is difficult to compare two values of the CHI index from different data sets.

Implications of Clustering Application
This study provides an understanding of different clustering algorithms, applicability with different datasets, and an assessment of cluster solutions in flood detection strategies. For instance, as water level is one of the inferential indicators of local flood events, clusters with abnormal water level can be identified as early warning signals of flooding. As new data become availabel during monitoring, these can be assigned to the most similar cluster. Decreasing dissimilarity to abnormal cluster therefore indicates increasing likelihood of flooding. In Figure 8, we observed that there is one isolated dot for each subplot. These separated points represent the highly dissimilar water depth data, indicating the possibility of triggering flood events. These same cases are also captured in the dendrogram of Figure 9 which presents that the junction 8 highlighted with red cross might be the source of anomalous water level. One reasonable explanation for the anomalous cluster is the resultant flooding or overflow events occuring around the corresponding location. More attention are recommended to investigate if this location is flooded. Thus, it can be seen that classifying these points as anomalies is helpful for narrowing down the spatial searching domain from network-level to node-level, and consequently also reducing the timing and efforts in identifying the flooded locations in the complex network system [74][75][76]. We concluded that the occurrence of anomalous changes in water level in UDSs could be a timely reminder of the upstream or downstream overflow events for the neighborhoods. Our findings also explain how the characteristics of the dataset (notably length and magnitude) influence the number of clusters. This information could be employed to detect urban flood events using water depth datasets in other real drainage networks [66,67]. These clustering algorithms aim to efficiently capture the urban drainage flooding locations providing a basis for managing the existing drainage structures and developing sustainable urban drainage networks in urbanized areas [77].

Limitations and Future Work
Although this study has identified some clear differences in the application of cluster analysis, there are several limitations. Firstly, the majority of scenarios used time-series water depth datasets generated by model simulation. As these are smooth and noise-free, the results may not scale to field application. However, we found similarities between the results with the limited set of observed rainfall series used here, notably in the use of the different indices, but tend to result in a smaller number of clusters. Further work should apply these methods to a wider set of observed data to reduce the input (meteorological) uncertainties and meteorological variances if such data becomes available [36,37,78,79]. The possible integration of ensemble prediction system (EPS) and data assimilation techniques might be of interest for future work, which could provide help for estimating forecast uncertainty via a linear combination of suitable meteorological variances and uncertainties linked to the rainfall and hydraulics [80,81]. Secondly, as this paper only focuses on exploring usefulness of clustering model implementation and performance evaluation, analysis of errors and sensitivity analysis of water level datasets are recommended for to improve the reliability of results. Future work will concentrate on the application of these methods, including water-level sensor placement, combined sewer overflow detection, and urban flooding prediction. Since the dendrogram enables the AC algorithm to detect outliers in time-series water depth datasets, this can be used to help guide sensor deployment on vulnerable sites for observing overflow and flood events in the field [76]. It is planned to consider strengthening the connection between the theoretical results and field application by conducting a cluster analysis to optimize the sensor monitoring network for flooding detection at UDSs.

Summary and Conclusions
In the age of 'smart stormwater,' the increased deployment of sensors to monitor water level characteristics is resulting in rapidly accumulating data. It is becoming crucial to understand and promote methods to handle these big datasets to help in flood detection and control. This study aims to promote understanding of how cluster analysis facilitates the interpretation of the unlabeled time-series water depth data for flooding location detection at the stormwater urban drainage systems. In this work, three indexes, including silhouette coefficient index, Calinski-Harabasz index, and Davies-Bouldin index, were used to evaluate the performance of three popular unsupervised cluster analysis models namely K-means clustering, agglomerative clustering and spectral clustering. A real-world stormwater urban drainage systems SWMM model was applied to test the performance of clustering algorithms in capturing urban floods. Five conclusions were drawn below: (1) Silhouette coefficient index and Davies-Bouldin index are suitable metrics to measure the performance of K-means and agglomerative clustering model when subject to identify the number of clusters for the best performance. However, the Calinski-Harabasz Index is found to be more favorable to assess the performance of the spectral clustering model in grouping time-series water depth datasets for urban drainage flooding detection. (2) In K-means and spectral clustering models, the number of the clusters for maximizing model performance is highly related to the dataset length (flooding duration) but is slightly associated with the dataset magnitude. There is a negative correlation between the number of clusters and the length of datasets.
(3) The short-period water depth data can be well-grouped by the agglomerative clustering model. In contrast, K-means and spectral clustering models are better able to handle time-series water depth datasets from long-duration storm scenarios. (4) This research work provides insight into unlabeled hydraulic data-driven techniques by conducting clustering experiments. The outcomes are useful for researchers to select the appropriate clustering model and to choose the corresponding performance metrics for specific urban flooding applications. (5) The detailed analyses in this work provide guidance concerning how to use cluster solutions to isolate or prescreen vulnerable locations for flooded location detection strategies. The water level in isolated clusters can be considered as the floods early warning for the local residents. The occurrence of anomalous changes in water level in urban drainage systems could be a timely reminder of the upstream or downstream flood events for the surrounding neighborhoods.