Clustering of Time Series Water Quality Data Using Dynamic Time Warping: A Case Study from the Bukhan River Water Quality Monitoring Network

: It is essential to monitor water quality for river water management because river water is used for various purposes and is directly related to the health and safety of a population. Proper network installation and removal is an important part of water quality monitoring and network operation e ﬃ ciency. To do this, cluster analysis based on calculated similarity between measuring stations can be used. In this study, we measured the similarities between 12 water quality monitoring stations of the Bukhan River. River water quality data always have a station-dependent time lag because water ﬂows from upstream to downstream; therefore, we proposed a Dynamic Time Warping (DTW) algorithm that searches for the minimum distance by changing and comparing time-points, rather than using the Euclidean algorithm, which compares the same time-point. Both Euclidean and DTW algorithms were applied to nine water quality variables to identify similarities between stations, and K-medoids cluster analysis were performed based on the similarity. The Clustering Validation Index (CVI) was used to select the optimal number of clusters. Our results show that the Euclidean algorithm formed clusters by mixing mainstream and tributary stations; the mainstream stations were largely divided into three di ﬀ erent clusters. In contrast, the DTW algorithm formed clear clusters by reﬂecting the characteristics of water quality and watershed. Furthermore, because the Euclidean algorithm requires the lengths of the time series to be the same, data loss was inevitable. As a result, even where clusters were the same as those obtained by DTW, the characteristics of the water quality variables in the cluster di ﬀ ered. The DTW analysis in this study provides useful information for understanding the similarity or di ﬀ erence in water parameter values between di ﬀ erent locations. Thus, the number and location of required monitoring stations can be adjusted to improve the e ﬃ ciency of ﬁeld monitoring network management.


Introduction
River water is used for various purposes (e.g., human consumption, agricultural irrigation) and is directly related to the health and safety of a population. As such, it is essential to monitor water quality for river water management. To this end, the Ministry of Environment of the Republic of Korea has installed water quality monitoring networks along rivers nationwide. However, as the number of measurement stations increases, the time and cost of data analysis has also increased. Therefore, it is increasingly important to operate optimal water quality monitoring networks, including the efficient selection and removal of water quality measurement stations. It is possible to reduce costs by grouping stations with similar water quality characteristics into clusters using cluster analysis, and then measuring the water quality by selecting a representative point in each cluster.
Cluster analysis is a multivariate data analysis method that groups objects into several clusters by measuring the similarity between objects through distance and identifies the characteristics of each cluster [1]. Many studies have applied cluster analysis to water quality data. Aubert et al. [2] applied cluster analysis to multivariate time series water quality data collected from the Kervidy-Naizin watershed to identify flood patterns. Lyra el al. [3] applied cluster analysis to rainfall data in order to identify rainfall patterns by year and month. However, cluster analysis divides objects into clusters based on similarity or dissimilarity, and so the results vary depending on the type of distance measure.
The most commonly used approach is based on Euclidian distance. Emad et al. [4] evaluated the similarity of 11 sample points based on Euclidian distances and performed cluster analysis using 16 water quality variables measured monthly from 2008 to 2009 to evaluate the water quality of the Euphrates river. Azhar et al. [5] used six water quality variables collected from 1998 to 2007 to measure the similarity of nine points in the Muda River basin through Euclidean distance clustering. The Dynamic Time Warping (DTW) algorithm offers an alternative algorithm [6] that has been applied in numerous fields. For example, Sakoe et al. [7] applied the DTW algorithm to speech recognition as a method for measuring the similarity of speech data. Dürrenmatt et al. [8] applied DTW to water quality data (e.g., water temperature) from upstream and downstream sensors to measure the travel time and calculate velocity; from the results, they proposed ways to improve the monitoring of sewage flow rate.
Woo et al. [9] used DTW to compare model-predicted and observed water conductivity signals at 5 min intervals at four monitoring points with time shift and amplitude difference using tracer study data provided by the Hillsborough County Water Resources Services (HCWRS) in Florida. They showed that DTW improves the alignment of the observed and model-predicted tracer signals over conventional methods. Dupas et al. [10] conducted a study to identify seasonal variations in phosphate concentration with storm events using DTW. Because length of high frequency storm concentration time series may differ, it is difficult to calculate a distance between pairs of comparable points for clustering. They showed that DTW-based K-means clustering algorithm proved useful for identifying common patterns in water quality time series and for isolating unusual events.
Time series water quality data reported in the Water Information System of South Korea (WIS, http://water.nier.go.kr) have a time lag as water flows from upstream to downstream. Furthermore, the lengths of the data set differ owing to sensor device failure at some stations. To determine the similarity of time series data, clustering of stations is commonly applied. The Euclidean algorithm is widely used because it is a simple method that calculates the sum of linear distances at the same time point to compare two time series. However, it is difficult to apply to two time series that have time lags or different lengths because it compares the same time points. Because the Euclidean distance aligns the point of one sequence with the same time point of the other sequence, the Euclidean distance may lead to low similarity when applied to data with a time delay. In contrast, the Dynamic Time Warping (DTW) algorithm compares two time series by changing the comparison time point, and so it is possible to compare data with time lags or with different lengths without loss. Furthermore, it has the advantage of being relatively robust against distortion and deformation of time. Therefore, DTW can measure the similarity reflecting the time delay caused by the distance between measurement points, Water 2020, 12, 2411 3 of 17 and the result of cluster analysis based on this is expected to be more reasonable than the Euclidean algorithm. Ouyang et al. [11] used DTW to calculate the similarity of hydrological time series, because hydrological time series data, like the two flood sequences, have approximately the same overall shape, but the shapes are not aligned in the time axis.
This study aimed to measure similarities between water quality data measured at different water quality monitoring stations by using the DTW algorithm to perform cluster analysis; the results were compared with those clustered using the Euclidean algorithm. Both approaches were applied to data collected weekly along the Bukhan River by the Ministry of Environment of the Republic of Korea. The cluster results according to the similarity algorithm were then compared based on the characteristics and patterns of variables for each cluster.

Study Site
The Bukhan River originates in Gangwon-do and, along with the Namhan River, is one of two major tributaries of the Han River. In this study, we used weekly water quality data of the Bukhan River from 2016 to 2018 obtained from the WIS. There are total of 12 monitoring stations ( Figure 1; Table 1), including 6 main stations (Hwacheon, Chuncheon A, UiamDam, Cheongpyeong, Sambongli, and PaldangDam) and 6 tributary stations (Gapyeongcheon3(stream), Jojongcheon3(stream), Mukhyeoncheon(stream), Soyanggang2(river), Hongcheongang6(river), and Byeoggyecheon(stream)).
approximately the same overall shape, but the shapes are not aligned in the time axis.
This study aimed to measure similarities between water quality data measured at different water quality monitoring stations by using the DTW algorithm to perform cluster analysis; the results were compared with those clustered using the Euclidean algorithm. Both approaches were applied to data collected weekly along the Bukhan River by the Ministry of Environment of the Republic of Korea. The cluster results according to the similarity algorithm were then compared based on the characteristics and patterns of variables for each cluster.

Study Site
The Bukhan River originates in Gangwon-do and, along with the Namhan River, is one of two major tributaries of the Han River. In this study, we used weekly water quality data of the Bukhan River from 2016 to 2018 obtained from the WIS. There are total of 12 monitoring stations ( Figure 1; Table 1), including 6 main stations (Hwacheon, Chuncheon A, UiamDam, Cheongpyeong, Sambongli, and PaldangDam) and 6 tributary stations (Gapyeongcheon3(stream), Jojongcheon3(stream), Mukhyeoncheon(stream), Soyanggang2(river), Hongcheongang6(river), and Byeoggyecheon(stream)).

Data
In this study, nine water quality variables contained in the WIS were used, including hydrogen ion concentration, dissolved oxygen, biochemical oxygen requirements, chemical oxygen

Data
In this study, nine water quality variables contained in the WIS were used, including hydrogen ion concentration, dissolved oxygen, biochemical oxygen requirements, chemical oxygen requirements, suspended solids, total nitrogen, total phosphorus, water temperature, and electrical conductivity ( Table 2).  Table 3 shows the basic statistics and lengths of the data from the 12 stations. The Euclidean algorithm requires data of the same length for comparison, whereas the DTW algorithm can be used even if data lengths are different. Therefore, when using Euclidean algorithms, only data collected at the same time at all stations were used for analysis. If data were missing at even one station, the data at that point were removed for all stations. As a result, a total of 1680 data points were analyzed using the Euclidean algorithm, and 1814 data points were analyzed using the DTW algorithm. Missing Data This study used the Kalman replacement method, which replaces missing values based on the Kalman filter, to replace missing values in the data. The Kalman filter removes data noise and predicts future data by using historical and newly measured data in a dynamic linear model that changes over time.
The Kalman filter repeats the prediction and update phases to predict variables. In the prediction step, the state of variable at time t and the preliminary estimate of the error covariance are calculated using the state variable at time t−1. Then, in the update step, the estimation is updated by reflecting the Kalman gain and the observation at time t in the prior estimate. The updated estimate is called the posterior estimate. Figure 2 visualizes the missing data values used in this study; the ratio of missing values for the whole dataset was 16.5%. Missing Data This study used the Kalman replacement method, which replaces missing values based on the Kalman filter, to replace missing values in the data. The Kalman filter removes data noise and predicts future data by using historical and newly measured data in a dynamic linear model that changes over time.
The Kalman filter repeats the prediction and update phases to predict variables. In the prediction step, the state of variable at time t and the preliminary estimate of the error covariance are calculated using the state variable at time t−1. Then, in the update step, the estimation is updated by reflecting the Kalman gain and the observation at time t in the prior estimate. The updated estimate is called the posterior estimate. Figure 2 visualizes the missing data values used in this study; the ratio of missing values for the whole dataset was 16.5%.

Dynamic Time Warping
This study conducted K-medoids cluster analysis using the Euclidean algorithm and the DTW algorithm for water quality network data using 'dtw' package in the statistical program R (ver 3.6.1).
The Euclidean algorithm compares two time series one-on-one at the same time; therefore, the

Dynamic Time Warping
This study conducted K-medoids cluster analysis using the Euclidean algorithm and the DTW algorithm for water quality network data using 'dtw' package in the statistical program R (ver 3.6.1).
The Euclidean algorithm compares two time series one-on-one at the same time; therefore, the lengths of the two time series must be the same. Therefore, if the length of the time series being compared is different, the length must be transformed, which inevitably leads to loss of information. In addition, if the values of two stations at the same time were measured with a delay (e.g., the distance between the stations), the similarity between the two time series is low, even if they are of the same time series.
In contrast, the DTW algorithm matches time series in a direction that minimizes the distance between two time series, allowing data at different points in time to be compared. Thus, when comparing time series of different lengths through the DTW algorithm, they can be compared without loss of data. Furthermore, the DTW algorithm has the advantage of being relatively robust in distortion and deformation. Figure 3 shows a conceptual plot of the Euclidian and DTW algorithms, assuming that there are two time series Q and R with different lengths m, n. 'E' and 'W' are lines showing mapping between two points that each methodology compares. Since Euclidean distance only compares data at the same time point, 'E' looks like a vertical line. On the other hand, since DTW can also compare data at different time, 'W' may not be a vertical line, unlike 'E'.
Water 2020, 12, x FOR PEER REVIEW 6 of 18 comparing time series of different lengths through the DTW algorithm, they can be compared without loss of data. Furthermore, the DTW algorithm has the advantage of being relatively robust in distortion and deformation. Figure 3   Assuming two time series Q and R with different lengths of m, n (Figure 4a), the DTW algorithm proceeds as follows: Step 1 (Figure 4b): Create local cost matrix C, also called the local distance matrix, by using the local cost function c, which represents the Euclidean distance between two points (q,r). For two time series Q and R, the local distance function c and Cost matrix C are defined as follows: where c is the distance between any two points of time series Q and R, and i and j are the indices representing the i-th and j-th points of each time series. The c is the distance function and has a smaller value if the comparison targets are similar, and a larger value if they are different. If the DTW algorithm is applied to multivariate time-series data with V variables, the cost distance is calculated by calculating the local distances for each variable at the same time and summing them as follows: Step 2 (Figure 4c): Create a global cost matrix M. First, the first row and first column of matrix M are calculated as follows: Assuming two time series Q and R with different lengths of m, n (Figure 4a), the DTW algorithm proceeds as follows: Step 1 (Figure 4b): Create local cost matrix C, also called the local distance matrix, by using the local cost function c, which represents the Euclidean distance between two points (q,r). For two time series Q and R, the local distance function c and Cost matrix C are defined as follows: where c is the distance between any two points of time series Q and R, and i and j are the indices representing the i-th and j-th points of each time series. The c is the distance function and has a smaller value if the comparison targets are similar, and a larger value if they are different. If the DTW algorithm is applied to multivariate time-series data with V variables, the cost distance is calculated by calculating the local distances for each variable at the same time and summing them as follows: Step 2 (Figure 4c): Create a global cost matrix M. First, the first row and first column of matrix M are calculated as follows: At this time, M(1,1) = C(1,1). Then, calculate the rest of the matrix as follows: Step 3 (Figure 4d): Find the optimum warping path to meet the constraints mentioned by Sakoe et al. [7] in global cost matrix M. The warping path, which satisfies constraints, exists in a variety of ways. The DTW algorithm considers that the path where the sum of the distances between two time series is the minimum is the optimal warping path of the two time series. Mapping two time series through a warping path is shown in Figure 4e.
Step 3 (Figure 4d): Find the optimum warping path to meet the constraints mentioned by Sakoe et al. [7] in global cost matrix M. The warping path, which satisfies constraints, exists in a variety of ways. The DTW algorithm considers that the path where the sum of the distances between two time series is the minimum is the optimal warping path of the two time series. Mapping two time series through a warping path is shown in Figure 4e.

Clustering Method
Clustering is a multivariate data analysis method that groups objects into several clusters based on similarities between objects through distance measurements and identifies the characteristics of each cluster. At this time, objects with high similarity share the same cluster and those with low

Clustering Method
Clustering is a multivariate data analysis method that groups objects into several clusters based on similarities between objects through distance measurements and identifies the characteristics of each cluster. At this time, objects with high similarity share the same cluster and those with low similarity have different clusters. In other words, clusters are formed so that the variance of data within a cluster is minimal and the variance between clusters is maximal. Cluster analysis includes various methods such as K-means clustering, hierarchical clustering, K-medoids clustering, the Fuzzy algorithm, etc., and uses only given data without prior information about the data.
The K-medoids algorithm used in this study was proposed by Kaufman et al. [12] and is a method of forming a cluster using medoid, a representative object located at the center of the cluster. The K-medoids algorithm is also called the Partitioning Around Medoids (PAM) algorithm, and the process of forming a cluster is as follows. Initially, randomly select k medoids and assign the remaining objects to clusters with the nearest medoid. Then, for each cluster, the object with the smallest mean of distance to all objects is set to the new center medoid. The remaining objects are assigned clusters with the nearest medoid again. This process is repeated until the new medoid and existing medoid are identical, and the cluster at the end of the iterative process is the optimal K-medoids cluster. The K-medoids algorithm repeatedly replaces one of the non-medoid objects with a medoid, minimizing the distance value between the objects forming the same cluster. This method represents an improvement on the k-means algorithm, which is greatly affected by outliers; it is less sensitive to outliers because it does not use the average as the central object. PAM cluster analysis were performed using the 'dtwclust' package in R, and since the units of the water quality variables are all different, the distance calculation and cluster analysis were performed after normalization for each variable.

Clustering Validation Index
The Clustering Validation Index (CVI), which is used to determine the optimal number of clusters, was used [13,14]. CVI is an indicator of how well clusters are formed. In this study, six internal CVIs were used; they were calculated based solely on data and cluster results. The CVIs were the Silhouette (Sil) proposed by Rousseeuw [15], the Calinski-Harabasz (CH) proposed by Caliński et al. [16], the Dunn (D) proposed by Dunn [17], the Davies Bouldin (DB) proposed by Davies et al. [18], the COP index proposed by Gurrutxaga et al. [19], and the Modified Davies Bouldin (MDB) proposed by Kim et al. [20]. The smaller the values of Sil, CH, and D, the larger the values of DB, MDB, and COP, and the better the clusters were formed. CVI were performed using the 'dtwclust' package in R.

Optimization CVI
The number of clusters was set in advance from two to five, and the optimal number of clusters was determined based on the CVI (Table 4). Typically, many CVIs are utilized and compared to each other, and a majority vote can be used to determine the final outcome [13,14]. Therefore, the optimal number of clusters was found to be five for both algorithms. Missing data values were replaced using Kalman filters, and standardized data were used for analysis.

Comparison of the Euclidean and Dynamic Time Warping Algorithms
Water quality measuring stations were divided into five clusters as shown in Figure 5. Using the Euclidean algorithm, Mukhyeoncheon formed a cluster alone; Sambongli, UiamDam, Chuncheon A, and PaldangDam formed a cluster; Gapyeongcheon3, Jojongcheon3, and Cheongpyeong formed a cluster; and finally, Byeoggyecheon and Hongcheongang6 formed a cluster. However, when the DTW algorithm was used, Gapyeongcheon3 and Jojongcheon3(stream) were separated from Cheongpyeong, while UiamDam was separated from other measuring stations in the mainstream. Cheongpyeong and Hwacheon formed a cluster with Soyanggang2, Chuncheon A, and PaldangDam. As for the Euclidean algorithm, Mukhyeoncheon formed a cluster alone. Unlike the Euclidean algorithm, UiamDam also formed a cluster alone.

Comparison of the Euclidean and Dynamic Time Warping Algorithms
Water quality measuring stations were divided into five clusters as shown in Figure 5. Using the Euclidean algorithm, Mukhyeoncheon formed a cluster alone; Sambongli, UiamDam, Chuncheon A, and PaldangDam formed a cluster; Gapyeongcheon3, Jojongcheon3, and Cheongpyeong formed a cluster; and finally, Byeoggyecheon and Hongcheongang6 formed a cluster. However, when the DTW algorithm was used, Gapyeongcheon3 and Jojongcheon3(stream) were separated from Cheongpyeong, while UiamDam was separated from other measuring stations in the mainstream. Cheongpyeong and Hwacheon formed a cluster with Soyanggang2, Chuncheon A, and PaldangDam. As for the Euclidean algorithm, Mukhyeoncheon formed a cluster alone. Unlike the Euclidean algorithm, UiamDam also formed a cluster alone. For clusters formed with both algorithms, there was a statistically significant mean difference according to clusters formed from all variables. To examine in detail the mean differences according to the cluster, Tukey's test was used at a significance level of 0.05 (Table 5). When statistically For clusters formed with both algorithms, there was a statistically significant mean difference according to clusters formed from all variables. To examine in detail the mean differences according to the cluster, Tukey's test was used at a significance level of 0.05 (Table 5). When statistically significant differences were found, the groups were expressed differently; a, b, c, d were expressed in order from the highest group to the lowest group.  When using the DTW algorithm, Cluster 1 (Soyanggang2 and the mainstream except UiamDam) showed low TN and Temp. Cluster 2 (Gapyeongcheon3 and Jo Jongcheon3) showed high pH, DO, and Temp, and low COD. Cluster 3 (UiamDam) showed high pH and low TN, Temp, and EC. Cluster 4 (Byeoggyecheon and Hongcheongang6) has high pH and DO, and low BOD, COD, and Temp averages. Cluster 5 (Mukhyeoncheon) showed the same characteristics as that formed using the Euclidian algorithm. Table 6 shows cluster-specific variable characteristics for each algorithm. Both the Euclidean and DTW algorithms included Byeoggyecheon and Hongcheongang6 in the same cluster, but the water quality characteristics of the clusters differed. This difference reflects the need for data removal when using the Euclidean algorithm, and shows that data distortion is inevitable when the time series being compared have different lengths. Figures 6 and 7 show cluster-specific boxplots for each variable when the 12 streams are divided into 5 clusters using K-medoids cluster analysis with the Euclidian and DTW algorithms, respectively. When using the Euclidian algorithm ( Figure 6), Mukhyeoncheon (Cluster 5) had a greater value and deviation than other streams belonging to other clusters in terms of BOD, COD, TN, TP, and EC. The result is similar using the DTW algorithm (Figure 7). Figures 8 and 9 show cluster-specific time series plots for each variable when the 12 streams were divided into 5 clusters through K-medoids cluster analysis using the Euclidian and DTW algorithms, respectively. In both cases, Cluster 5 (Mukhyeoncheon) showed unusual water parameter patterns compared with the other streams. Regardless of cluster, water temperature was similar for most rivers, although Soyanggang2 had unusually low water temperatures compared with the other streams.

Comparison of Clustering and Water Quality Patterns
The Euclidean algorithm, which aligns the i-th point of one sequence with the i-th point of the other sequence, can cause low similarity. Because of this, DTW, which allows nonlinear alignment, is often used instead of Euclidean algorithm in various fields. In hydrology, Ouyang et al. [11] used DTW algorithm instead of Euclidian algorithm for similarity search and pattern discovery in the hydrologic time series data. Chotirat et al. [20] classified time series data obtained from the video data with DTW-applied model and compared the model performance with the Euclidean based model.
So, this study applied Euclidean and DTW distance algorithms to water quality data to determine similarities among water quality at different monitoring stations and to identify the characteristics of water quality variables by cluster.
The Euclidean method clustered stations from the mainstream, left tributary, and right tributary together. In contrast, DTW formed three clusters that generally reflected the mainstream, left tributary, and right tributary (except for Soyanggang2, UiamDam, and Mukhyeoncheon). As such, the DTW approach better reflected the regional characteristics of the watersheds and hydraulic environments. Both algorithms showed statistically significant mean differences across clusters in all variables, and both clustered Byeoggyecheon and Hongcheongang6 together. However, the water quality characteristics of the clusters differed, highlighting the impact of unavoidable data removal when using the Euclidean algorithm, resulting in a distortion of water quality characteristics.
Cluster 1 of the DTW classification, representing mainstream stations, shows relatively better water quality than Cluster 2 (left tributary) and Cluster 4 (the right tributary). However, while the Soyanggang2 monitoring station belongs to the left-hand tributary, it was classified into the mainstream cluster. The reason for this is that the measuring station is located directly downstream of the Soyanggang Dam, a large-scale dam with a storage capacity of 29 million m 3 . The dam has relatively good water quality because it is located in a water resource protection zone and is used as a source of drinking water. As such, the water quality of Soyanggan2 monitoring station is relatively better compared with the monitoring stations in Cluster 2 (i.e., the other stations on the left-hand tributary), and so it is was classified into the mainstream cluster, for which stations show better water quality.
Despite being a right-hand tributary, Mukhyeoncheon was classified as a separate cluster. The concentrations of BOD and COD (representative of pollution by organic matter), and of TN and TP (representative of pollution by nutrients) averaged 2.38 mg/L, 6.17 mg/L, 3.34 mg/L, and 0.1 mg/L, respectively. This demonstrates markedly higher levels of contamination than other stations, particularly those in Cluster 4 (the other right-hand tributaries).
UiamDam belongs to the mainstream, but its BOD, COD, TN, TP, and SS values are all greater than those of other stations in DTW Cluster 1 (i.e., other mainstream stations). In particular, the average BOD was 1.31 mg/L, which is significantly higher than the average of the stations corresponding to DTW Cluster 1.
The Bukhan River Basin has a low population density and few industrial facilities. The Mukhyeoncheon and UiamDam flow along the densely populated metropolitan cities in the Basin and are affected by various downtown streams. We believe that various point/non-point pollutants in these urban areas affect the water quality at these stations, which explains why they are not clustered with other stations from the same geographical area.
Long-term observational data obtained from on-site monitoring networks are critical for the proper management of water quality and ecosystems. However, the operation of on-site monitoring stations is not always possible due to limited budget. The DTW analysis in this study provides useful information for understanding the similarity or difference in water parameter values between different locations. Thus, the number and location of required monitoring stations can be adjusted to improve the efficiency of field monitoring network management.

Limits and Future Work
The main limitation of this study was a low number of samples, which reflects the relatively small-scale nature of the Bukhan River water system. As DTW takes a long time, with the calculation time increasing as the data volume increases, using a limited number of samples was necessary. However, in the future, analysis will be expanded to include more rivers. In addition, the water quality data used in this study included missing values owing to the failure of the measuring sensors and/or human error (e.g., lack of responsibility). If missing value significantly change time series trend, the reliability of the analysis results may be lowered.
In future studies, it may be possible to add variables such as chlorophyll-a and fecal coliforms, which were not used in the analysis owing to high rates of missing data, or to select proxy variables that reflect the characteristics of water quality variables. However, if the sources of stream pollution continue to increase, there is a limit to the efficacy of monitoring and improving water quality relied on general concentration regulation methods. To this end, the Ministry of Environment of South Korea is introducing and implementing a "Total Water Pollution Load Management System" that regulates the amount of pollution and reflects the emission of pollutants. The pollutant load data (including flow rate) were not utilized in this study because measurement frequency and measurement points remain limited. However, if sufficient data are secured in the future, it would be possible to add pollutant load data to reflect the amount of pollutant discharge. We will also plan to compare the result of dynamic PCA with the one of DTW.

Conclusions
Proper network installation and removal is an important part of water quality monitoring and network operation efficiency. To reduce the time and cost required to secure and monitor water quality data at locations where measurement is difficult, cluster analysis based on calculated similarity and dissimilarity between measuring stations can be used. Cluster analysis forms clusters based on the similarity measured according to distance, and so cluster results may vary depending on the type of distance. This study clustered water quality measuring stations of the Bukhan River water system using the K-medoids cluster analysis based on both the Euclidean and DTW algorithms.
The Euclidean algorithm compares the same time points of two time series and is limited by the fact that the lengths of the two time series must be the same. In contrast, the DTW algorithm compares time series while changing the time point and can be used even if the lengths of the two time series are different. In water quality measurement network data, there is a time lag as water flows from upstream to downstream, and the length of the data may be different for each measuring station owing to failures of the measuring device, etc. Therefore, when clustering water quality data from a measurement network, it should be preferable to use the DTW algorithm.
Our results show that the Euclidean algorithm formed clusters by mixing mainstream and tributary stations; the mainstream stations were largely divided into three different clusters. In contrast, the DTW algorithm formed clear clusters by reflecting the characteristics of water quality and watershed. Furthermore, because the Euclidean algorithm requires the lengths of the time series to be the same, data loss was inevitable. As a result, even where clusters were the same as those obtained by DTW, the characteristics of the water quality variables in the cluster differed.