Next Article in Journal
Land Use and Water Quality
Previous Article in Journal
Microplastic and Fibre Contamination in a Remote Mountain Lake in Switzerland
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clustering of Time Series Water Quality Data Using Dynamic Time Warping: A Case Study from the Bukhan River Water Quality Monitoring Network

1
Future Strategy Department, Chungbuk Innovation Institute of Science & Technology, Chungbuk 28126, Korea
2
Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea
3
Environmental Measurement and Analysis Center, National Institute of Environmental Research, Incheon 22689, Korea
4
Engineering Division, DongMoon ENT Co., Ltd., Seoul 08377, Korea
5
Department of Civil and Environmental Engineering, Hanbat National University, Daejeon 34158, Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this manuscript.
Water 2020, 12(9), 2411; https://doi.org/10.3390/w12092411
Submission received: 14 July 2020 / Revised: 24 August 2020 / Accepted: 25 August 2020 / Published: 27 August 2020
(This article belongs to the Section Water Resources Management, Policy and Governance)

Abstract

:
It is essential to monitor water quality for river water management because river water is used for various purposes and is directly related to the health and safety of a population. Proper network installation and removal is an important part of water quality monitoring and network operation efficiency. To do this, cluster analysis based on calculated similarity between measuring stations can be used. In this study, we measured the similarities between 12 water quality monitoring stations of the Bukhan River. River water quality data always have a station-dependent time lag because water flows from upstream to downstream; therefore, we proposed a Dynamic Time Warping (DTW) algorithm that searches for the minimum distance by changing and comparing time-points, rather than using the Euclidean algorithm, which compares the same time-point. Both Euclidean and DTW algorithms were applied to nine water quality variables to identify similarities between stations, and K-medoids cluster analysis were performed based on the similarity. The Clustering Validation Index (CVI) was used to select the optimal number of clusters. Our results show that the Euclidean algorithm formed clusters by mixing mainstream and tributary stations; the mainstream stations were largely divided into three different clusters. In contrast, the DTW algorithm formed clear clusters by reflecting the characteristics of water quality and watershed. Furthermore, because the Euclidean algorithm requires the lengths of the time series to be the same, data loss was inevitable. As a result, even where clusters were the same as those obtained by DTW, the characteristics of the water quality variables in the cluster differed. The DTW analysis in this study provides useful information for understanding the similarity or difference in water parameter values between different locations. Thus, the number and location of required monitoring stations can be adjusted to improve the efficiency of field monitoring network management.

1. Introduction

River water is used for various purposes (e.g., human consumption, agricultural irrigation) and is directly related to the health and safety of a population. As such, it is essential to monitor water quality for river water management. To this end, the Ministry of Environment of the Republic of Korea has installed water quality monitoring networks along rivers nationwide. However, as the number of measurement stations increases, the time and cost of data analysis has also increased. Therefore, it is increasingly important to operate optimal water quality monitoring networks, including the efficient selection and removal of water quality measurement stations. It is possible to reduce costs by grouping stations with similar water quality characteristics into clusters using cluster analysis, and then measuring the water quality by selecting a representative point in each cluster.
Cluster analysis is a multivariate data analysis method that groups objects into several clusters by measuring the similarity between objects through distance and identifies the characteristics of each cluster [1]. Many studies have applied cluster analysis to water quality data. Aubert et al. [2] applied cluster analysis to multivariate time series water quality data collected from the Kervidy-Naizin watershed to identify flood patterns. Lyra el al. [3] applied cluster analysis to rainfall data in order to identify rainfall patterns by year and month. However, cluster analysis divides objects into clusters based on similarity or dissimilarity, and so the results vary depending on the type of distance measure.
The most commonly used approach is based on Euclidian distance. Emad et al. [4] evaluated the similarity of 11 sample points based on Euclidian distances and performed cluster analysis using 16 water quality variables measured monthly from 2008 to 2009 to evaluate the water quality of the Euphrates river. Azhar et al. [5] used six water quality variables collected from 1998 to 2007 to measure the similarity of nine points in the Muda River basin through Euclidean distance clustering. The Dynamic Time Warping (DTW) algorithm offers an alternative algorithm [6] that has been applied in numerous fields. For example, Sakoe et al. [7] applied the DTW algorithm to speech recognition as a method for measuring the similarity of speech data. Dürrenmatt et al. [8] applied DTW to water quality data (e.g., water temperature) from upstream and downstream sensors to measure the travel time and calculate velocity; from the results, they proposed ways to improve the monitoring of sewage flow rate.
Woo et al. [9] used DTW to compare model-predicted and observed water conductivity signals at 5 min intervals at four monitoring points with time shift and amplitude difference using tracer study data provided by the Hillsborough County Water Resources Services (HCWRS) in Florida. They showed that DTW improves the alignment of the observed and model-predicted tracer signals over conventional methods. Dupas et al. [10] conducted a study to identify seasonal variations in phosphate concentration with storm events using DTW. Because length of high frequency storm concentration time series may differ, it is difficult to calculate a distance between pairs of comparable points for clustering. They showed that DTW-based K-means clustering algorithm proved useful for identifying common patterns in water quality time series and for isolating unusual events.
Time series water quality data reported in the Water Information System of South Korea (WIS, http://water.nier.go.kr) have a time lag as water flows from upstream to downstream. Furthermore, the lengths of the data set differ owing to sensor device failure at some stations. To determine the similarity of time series data, clustering of stations is commonly applied. The Euclidean algorithm is widely used because it is a simple method that calculates the sum of linear distances at the same time point to compare two time series. However, it is difficult to apply to two time series that have time lags or different lengths because it compares the same time points. Because the Euclidean distance aligns the point of one sequence with the same time point of the other sequence, the Euclidean distance may lead to low similarity when applied to data with a time delay. In contrast, the Dynamic Time Warping (DTW) algorithm compares two time series by changing the comparison time point, and so it is possible to compare data with time lags or with different lengths without loss. Furthermore, it has the advantage of being relatively robust against distortion and deformation of time. Therefore, DTW can measure the similarity reflecting the time delay caused by the distance between measurement points, and the result of cluster analysis based on this is expected to be more reasonable than the Euclidean algorithm. Ouyang et al. [11] used DTW to calculate the similarity of hydrological time series, because hydrological time series data, like the two flood sequences, have approximately the same overall shape, but the shapes are not aligned in the time axis.
This study aimed to measure similarities between water quality data measured at different water quality monitoring stations by using the DTW algorithm to perform cluster analysis; the results were compared with those clustered using the Euclidean algorithm. Both approaches were applied to data collected weekly along the Bukhan River by the Ministry of Environment of the Republic of Korea. The cluster results according to the similarity algorithm were then compared based on the characteristics and patterns of variables for each cluster.

2. Material and Methods

2.1. Study Site

The Bukhan River originates in Gangwon-do and, along with the Namhan River, is one of two major tributaries of the Han River. In this study, we used weekly water quality data of the Bukhan River from 2016 to 2018 obtained from the WIS. There are total of 12 monitoring stations (Figure 1; Table 1), including 6 main stations (Hwacheon, Chuncheon A, UiamDam, Cheongpyeong, Sambongli, and PaldangDam) and 6 tributary stations (Gapyeongcheon3(stream), Jojongcheon3(stream), Mukhyeoncheon(stream), Soyanggang2(river), Hongcheongang6(river), and Byeoggyecheon(stream)).

2.2. Data

In this study, nine water quality variables contained in the WIS were used, including hydrogen ion concentration, dissolved oxygen, biochemical oxygen requirements, chemical oxygen requirements, suspended solids, total nitrogen, total phosphorus, water temperature, and electrical conductivity (Table 2).
Table 3 shows the basic statistics and lengths of the data from the 12 stations. The Euclidean algorithm requires data of the same length for comparison, whereas the DTW algorithm can be used even if data lengths are different. Therefore, when using Euclidean algorithms, only data collected at the same time at all stations were used for analysis. If data were missing at even one station, the data at that point were removed for all stations. As a result, a total of 1680 data points were analyzed using the Euclidean algorithm, and 1814 data points were analyzed using the DTW algorithm.

Missing Data

This study used the Kalman replacement method, which replaces missing values based on the Kalman filter, to replace missing values in the data. The Kalman filter removes data noise and predicts future data by using historical and newly measured data in a dynamic linear model that changes over time.
The Kalman filter repeats the prediction and update phases to predict variables. In the prediction step, the state of variable at time t and the preliminary estimate of the error covariance are calculated using the state variable at time t−1. Then, in the update step, the estimation is updated by reflecting the Kalman gain and the observation at time t in the prior estimate. The updated estimate is called the posterior estimate. Figure 2 visualizes the missing data values used in this study; the ratio of missing values for the whole dataset was 16.5%.

2.3. Dynamic Time Warping

This study conducted K-medoids cluster analysis using the Euclidean algorithm and the DTW algorithm for water quality network data using ‘dtw’ package in the statistical program R (ver 3.6.1).
The Euclidean algorithm compares two time series one-on-one at the same time; therefore, the lengths of the two time series must be the same. Therefore, if the length of the time series being compared is different, the length must be transformed, which inevitably leads to loss of information. In addition, if the values of two stations at the same time were measured with a delay (e.g., the distance between the stations), the similarity between the two time series is low, even if they are of the same time series.
In contrast, the DTW algorithm matches time series in a direction that minimizes the distance between two time series, allowing data at different points in time to be compared. Thus, when comparing time series of different lengths through the DTW algorithm, they can be compared without loss of data. Furthermore, the DTW algorithm has the advantage of being relatively robust in distortion and deformation. Figure 3 shows a conceptual plot of the Euclidian and DTW algorithms, assuming that there are two time series Q and R with different lengths m, n. ‘E’ and ‘W’ are lines showing mapping between two points that each methodology compares. Since Euclidean distance only compares data at the same time point, ‘E’ looks like a vertical line. On the other hand, since DTW can also compare data at different time, ‘W’ may not be a vertical line, unlike ‘E’.
Assuming two time series Q and R with different lengths of m, n (Figure 4a), the DTW algorithm proceeds as follows:
Step 1 (Figure 4b): Create local cost matrix C, also called the local distance matrix, by using the local cost function c, which represents the Euclidean distance between two points (q,r). For two time series Q and R, the local distance function c and Cost matrix C are defined as follows:
C ( i , j ) = c ( q i , r j ) = ( q i r i ) 2
where c is the distance between any two points of time series Q and R, and i and j are the indices representing the i-th and j-th points of each time series. The c is the distance function and has a smaller value if the comparison targets are similar, and a larger value if they are different. If the DTW algorithm is applied to multivariate time-series data with V variables, the cost distance is calculated by calculating the local distances for each variable at the same time and summing them as follows:
C ( i , j ) = v = 1 V c ( q iv , r jv ) = v = 1 V ( q iv r jv ) 2
Step 2 (Figure 4c): Create a global cost matrix M. First, the first row and first column of matrix M are calculated as follows:
M ( i , j ) = { C ( 1 , j ) + M ( 1 , j 1 ) C ( i , 1 ) + M ( i 1 , 1 )
At this time, M(1,1) = C(1,1). Then, calculate the rest of the matrix as follows:
M ( i , j ) = C ( i , j ) + min [ M ( i 1 , j 1 ) , M ( i 1 , j ) , M ( i , j 1 ) ]
Step 3 (Figure 4d): Find the optimum warping path to meet the constraints mentioned by Sakoe et al. [7] in global cost matrix M. The warping path, which satisfies constraints, exists in a variety of ways. The DTW algorithm considers that the path where the sum of the distances between two time series is the minimum is the optimal warping path of the two time series. Mapping two time series through a warping path is shown in Figure 4e.

2.4. Clustering Method

Clustering is a multivariate data analysis method that groups objects into several clusters based on similarities between objects through distance measurements and identifies the characteristics of each cluster. At this time, objects with high similarity share the same cluster and those with low similarity have different clusters. In other words, clusters are formed so that the variance of data within a cluster is minimal and the variance between clusters is maximal. Cluster analysis includes various methods such as K-means clustering, hierarchical clustering, K-medoids clustering, the Fuzzy algorithm, etc., and uses only given data without prior information about the data.
The K-medoids algorithm used in this study was proposed by Kaufman et al. [12] and is a method of forming a cluster using medoid, a representative object located at the center of the cluster. The K-medoids algorithm is also called the Partitioning Around Medoids (PAM) algorithm, and the process of forming a cluster is as follows. Initially, randomly select k medoids and assign the remaining objects to clusters with the nearest medoid. Then, for each cluster, the object with the smallest mean of distance to all objects is set to the new center medoid. The remaining objects are assigned clusters with the nearest medoid again. This process is repeated until the new medoid and existing medoid are identical, and the cluster at the end of the iterative process is the optimal K-medoids cluster. The K-medoids algorithm repeatedly replaces one of the non-medoid objects with a medoid, minimizing the distance value between the objects forming the same cluster. This method represents an improvement on the k-means algorithm, which is greatly affected by outliers; it is less sensitive to outliers because it does not use the average as the central object. PAM cluster analysis were performed using the ‘dtwclust’ package in R, and since the units of the water quality variables are all different, the distance calculation and cluster analysis were performed after normalization for each variable.

2.5. Clustering Validation Index

The Clustering Validation Index (CVI), which is used to determine the optimal number of clusters, was used [13,14]. CVI is an indicator of how well clusters are formed. In this study, six internal CVIs were used; they were calculated based solely on data and cluster results. The CVIs were the Silhouette (Sil) proposed by Rousseeuw [15], the Calinski-Harabasz (CH) proposed by Caliński et al. [16], the Dunn (D) proposed by Dunn [17], the Davies Bouldin (DB) proposed by Davies et al. [18], the COP index proposed by Gurrutxaga et al. [19], and the Modified Davies Bouldin (MDB) proposed by Kim et al. [20]. The smaller the values of Sil, CH, and D, the larger the values of DB, MDB, and COP, and the better the clusters were formed. CVI were performed using the ‘dtwclust’ package in R.

3. Results

3.1. Optimization CVI

The number of clusters was set in advance from two to five, and the optimal number of clusters was determined based on the CVI (Table 4). Typically, many CVIs are utilized and compared to each other, and a majority vote can be used to determine the final outcome [13,14]. Therefore, the optimal number of clusters was found to be five for both algorithms. Missing data values were replaced using Kalman filters, and standardized data were used for analysis.

3.2. Comparison of the Euclidean and Dynamic Time Warping Algorithms

Water quality measuring stations were divided into five clusters as shown in Figure 5. Using the Euclidean algorithm, Mukhyeoncheon formed a cluster alone; Sambongli, UiamDam, Chuncheon A, and PaldangDam formed a cluster; Gapyeongcheon3, Jojongcheon3, and Cheongpyeong formed a cluster; and finally, Byeoggyecheon and Hongcheongang6 formed a cluster. However, when the DTW algorithm was used, Gapyeongcheon3 and Jojongcheon3(stream) were separated from Cheongpyeong, while UiamDam was separated from other measuring stations in the mainstream. Cheongpyeong and Hwacheon formed a cluster with Soyanggang2, Chuncheon A, and PaldangDam. As for the Euclidean algorithm, Mukhyeoncheon formed a cluster alone. Unlike the Euclidean algorithm, UiamDam also formed a cluster alone.
For clusters formed with both algorithms, there was a statistically significant mean difference according to clusters formed from all variables. To examine in detail the mean differences according to the cluster, Tukey’s test was used at a significance level of 0.05 (Table 5). When statistically significant differences were found, the groups were expressed differently; a, b, c, d were expressed in order from the highest group to the lowest group.
When using the Euclidean algorithm, Cluster 1 (Sambongli, UiamDam, Chuncheon A, and PaldangDam) had lower average values of pH and DO. In contrast, Cluster 4 (Byeoggyecheon and Hongcheongang6) had high average values of pH and DO. Cluster 2 (Gapyeongcheon3, Jojongcheon3, and Cheongpyeong) had high pH, DO, and Temp. Cluster 3 (Soyanggang2 and Hwacheon) showed high average values of DO and low average values of BOD, COD, SS, TN, TP, Temp, and EC. Cluster 5 (Mukhyeoncheon) had low pH and DO and high values for other water quality variables; among them, BOD, COD, SS, TN, TP, and EC showed a very significant difference from other clusters.
When using the DTW algorithm, Cluster 1 (Soyanggang2 and the mainstream except UiamDam) showed low TN and Temp. Cluster 2 (Gapyeongcheon3 and Jo Jongcheon3) showed high pH, DO, and Temp, and low COD. Cluster 3 (UiamDam) showed high pH and low TN, Temp, and EC. Cluster 4 (Byeoggyecheon and Hongcheongang6) has high pH and DO, and low BOD, COD, and Temp averages. Cluster 5 (Mukhyeoncheon) showed the same characteristics as that formed using the Euclidian algorithm. Table 6 shows cluster-specific variable characteristics for each algorithm.
Both the Euclidean and DTW algorithms included Byeoggyecheon and Hongcheongang6 in the same cluster, but the water quality characteristics of the clusters differed. This difference reflects the need for data removal when using the Euclidean algorithm, and shows that data distortion is inevitable when the time series being compared have different lengths.

3.3. Comparison of Water Quality Characteristics for Each Cluster

Figure 6 and Figure 7 show cluster-specific boxplots for each variable when the 12 streams are divided into 5 clusters using K-medoids cluster analysis with the Euclidian and DTW algorithms, respectively. When using the Euclidian algorithm (Figure 6), Mukhyeoncheon (Cluster 5) had a greater value and deviation than other streams belonging to other clusters in terms of BOD, COD, TN, TP, and EC. The result is similar using the DTW algorithm (Figure 7).
Figure 8 and Figure 9 show cluster-specific time series plots for each variable when the 12 streams were divided into 5 clusters through K-medoids cluster analysis using the Euclidian and DTW algorithms, respectively. In both cases, Cluster 5 (Mukhyeoncheon) showed unusual water parameter patterns compared with the other streams. Regardless of cluster, water temperature was similar for most rivers, although Soyanggang2 had unusually low water temperatures compared with the other streams.

4. Discussion

4.1. Comparison of Clustering and Water Quality Patterns

The Euclidean algorithm, which aligns the i-th point of one sequence with the i-th point of the other sequence, can cause low similarity. Because of this, DTW, which allows nonlinear alignment, is often used instead of Euclidean algorithm in various fields. In hydrology, Ouyang et al. [11] used DTW algorithm instead of Euclidian algorithm for similarity search and pattern discovery in the hydrologic time series data. Chotirat et al. [20] classified time series data obtained from the video data with DTW-applied model and compared the model performance with the Euclidean based model.
So, this study applied Euclidean and DTW distance algorithms to water quality data to determine similarities among water quality at different monitoring stations and to identify the characteristics of water quality variables by cluster.
The Euclidean method clustered stations from the mainstream, left tributary, and right tributary together. In contrast, DTW formed three clusters that generally reflected the mainstream, left tributary, and right tributary (except for Soyanggang2, UiamDam, and Mukhyeoncheon). As such, the DTW approach better reflected the regional characteristics of the watersheds and hydraulic environments. Both algorithms showed statistically significant mean differences across clusters in all variables, and both clustered Byeoggyecheon and Hongcheongang6 together. However, the water quality characteristics of the clusters differed, highlighting the impact of unavoidable data removal when using the Euclidean algorithm, resulting in a distortion of water quality characteristics.
Cluster 1 of the DTW classification, representing mainstream stations, shows relatively better water quality than Cluster 2 (left tributary) and Cluster 4 (the right tributary). However, while the Soyanggang2 monitoring station belongs to the left-hand tributary, it was classified into the mainstream cluster. The reason for this is that the measuring station is located directly downstream of the Soyanggang Dam, a large-scale dam with a storage capacity of 29 million m3. The dam has relatively good water quality because it is located in a water resource protection zone and is used as a source of drinking water. As such, the water quality of Soyanggan2 monitoring station is relatively better compared with the monitoring stations in Cluster 2 (i.e., the other stations on the left-hand tributary), and so it is was classified into the mainstream cluster, for which stations show better water quality.
Despite being a right-hand tributary, Mukhyeoncheon was classified as a separate cluster. The concentrations of BOD and COD (representative of pollution by organic matter), and of TN and TP (representative of pollution by nutrients) averaged 2.38 mg/L, 6.17 mg/L, 3.34 mg/L, and 0.1 mg/L, respectively. This demonstrates markedly higher levels of contamination than other stations, particularly those in Cluster 4 (the other right-hand tributaries).
UiamDam belongs to the mainstream, but its BOD, COD, TN, TP, and SS values are all greater than those of other stations in DTW Cluster 1 (i.e., other mainstream stations). In particular, the average BOD was 1.31 mg/L, which is significantly higher than the average of the stations corresponding to DTW Cluster 1.
The Bukhan River Basin has a low population density and few industrial facilities. The Mukhyeoncheon and UiamDam flow along the densely populated metropolitan cities in the Basin and are affected by various downtown streams. We believe that various point/non-point pollutants in these urban areas affect the water quality at these stations, which explains why they are not clustered with other stations from the same geographical area.
Long-term observational data obtained from on-site monitoring networks are critical for the proper management of water quality and ecosystems. However, the operation of on-site monitoring stations is not always possible due to limited budget. The DTW analysis in this study provides useful information for understanding the similarity or difference in water parameter values between different locations. Thus, the number and location of required monitoring stations can be adjusted to improve the efficiency of field monitoring network management.

4.2. Limits and Future Work

The main limitation of this study was a low number of samples, which reflects the relatively small-scale nature of the Bukhan River water system. As DTW takes a long time, with the calculation time increasing as the data volume increases, using a limited number of samples was necessary. However, in the future, analysis will be expanded to include more rivers. In addition, the water quality data used in this study included missing values owing to the failure of the measuring sensors and/or human error (e.g., lack of responsibility). If missing value significantly change time series trend, the reliability of the analysis results may be lowered.
In future studies, it may be possible to add variables such as chlorophyll-a and fecal coliforms, which were not used in the analysis owing to high rates of missing data, or to select proxy variables that reflect the characteristics of water quality variables. However, if the sources of stream pollution continue to increase, there is a limit to the efficacy of monitoring and improving water quality relied on general concentration regulation methods. To this end, the Ministry of Environment of South Korea is introducing and implementing a “Total Water Pollution Load Management System” that regulates the amount of pollution and reflects the emission of pollutants. The pollutant load data (including flow rate) were not utilized in this study because measurement frequency and measurement points remain limited. However, if sufficient data are secured in the future, it would be possible to add pollutant load data to reflect the amount of pollutant discharge. We will also plan to compare the result of dynamic PCA with the one of DTW.

5. Conclusions

Proper network installation and removal is an important part of water quality monitoring and network operation efficiency. To reduce the time and cost required to secure and monitor water quality data at locations where measurement is difficult, cluster analysis based on calculated similarity and dissimilarity between measuring stations can be used. Cluster analysis forms clusters based on the similarity measured according to distance, and so cluster results may vary depending on the type of distance. This study clustered water quality measuring stations of the Bukhan River water system using the K-medoids cluster analysis based on both the Euclidean and DTW algorithms.
The Euclidean algorithm compares the same time points of two time series and is limited by the fact that the lengths of the two time series must be the same. In contrast, the DTW algorithm compares time series while changing the time point and can be used even if the lengths of the two time series are different. In water quality measurement network data, there is a time lag as water flows from upstream to downstream, and the length of the data may be different for each measuring station owing to failures of the measuring device, etc. Therefore, when clustering water quality data from a measurement network, it should be preferable to use the DTW algorithm.
Our results show that the Euclidean algorithm formed clusters by mixing mainstream and tributary stations; the mainstream stations were largely divided into three different clusters. In contrast, the DTW algorithm formed clear clusters by reflecting the characteristics of water quality and watershed. Furthermore, because the Euclidean algorithm requires the lengths of the time series to be the same, data loss was inevitable. As a result, even where clusters were the same as those obtained by DTW, the characteristics of the water quality variables in the cluster differed.

Author Contributions

Data curation, S.L.; software, J.K.; investigation E.L.; formal analysis, J.H.; funding acquisition, T.-Y.H. and K.-J.L.; supervision, T.-Y.H.; writing—original draft, S.L.; writing—review & editing, E.L., J.O., J.P., J.H. and K.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2017R1D1A3B03028084, 2019R1I1A3A01057696). The work was also supported by a grant from the National Institute of Environment Research (NIER), funded by the Ministry of Environment (MOE) of the Republic of Korea (NIER-2018-01-01-064).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Driver, H.E.; Kroeber, A.L. Quantitative Expression of Cultural Relationships. In American Archeology and Ethnology; University of California Press: Berkeley, CA, USA, 1932; Volume 31, pp. 211–256. [Google Scholar]
  2. Aubert, A.H.; Tavenard, R.; Emonet, R.; de Lavenne, A.; Malinowski, S.; Guyet, T.; Quiniou, R.; Odobez, J.-M.; Merot, P.; Gascuel-Odoux, C. Clustering flood events from water quality time series using Latent Dirichlet Allocation model. Water Resour. Res. 2013, 49, 8187–8199. [Google Scholar] [CrossRef]
  3. Lyra, G.B.; Oliveira-Júnior, J.F.; Zeri, M. Cluster analysis applied to the spatial and temporal variability of monthly rainfall in Alagoas state, Northeast of Brazil. Int. J. Climatol. 2014, 34, 3546–3558. [Google Scholar] [CrossRef]
  4. Emad, A.M.S.A.-H.; Ahmed, M.T.; Eethar, M.A.-O. Assessment of water quality of Euphrates River using cluster analysis. J. Environ. Prot. 2012, 3, 1629–1633. [Google Scholar] [CrossRef] [Green Version]
  5. Azhar, S.C.; Aris, A.Z.; Yusoff, M.K.; Ramli, M.F.; Juahir, H. Classification of river water quality using multivariate analysis. Procedia Environ. Sci. 2015, 30, 79–84. [Google Scholar] [CrossRef] [Green Version]
  6. Bellman, R.; Kalaba, R. On adaptive control processes. IRE Trans. Autom. Control 1959, 4, 1–9. [Google Scholar] [CrossRef]
  7. Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef] [Green Version]
  8. Dürrenmatt, D.J.; Del Giudice, D.; Rieckermann, J. Dynamic time warping improves sewer flow monitoring. Water Res. 2013, 47, 3803–3816. [Google Scholar] [CrossRef] [PubMed]
  9. Woo, H.; Boccelli, D.L.; Uber, J.G.; Janke, R.; Su, Y. Dynamic time warping for quantitative analysis of tracer study time-series water quality data. J. Water Res. Plan. Manag. 2019, 145, 04019052. [Google Scholar] [CrossRef]
  10. Dupas, R.; Tavenard, R.; Fovet, O.; Gilliet, N.; Grimaldi, C.; Gascuel-Odoux, C. Identifying seasonal patterns of phosphorus storm dynamics with dynamic time warping. Water Resour. Res. 2015, 51, 8868–8882. [Google Scholar] [CrossRef] [Green Version]
  11. Ouyang, R.; Ren, L.; Cheng, W.; Zhou, C. Similarity search and pattern discovery in hydrological time series data mining. Hydrol. Process. Int. J. 2010, 24, 1198–1210. [Google Scholar] [CrossRef]
  12. Kaufman, L.; Rousseeuw, P.J. References. In Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons, Inc.: Hoboken, NY, USA, 1990; ISBN 978-0-47031680-1. [Google Scholar]
  13. Sardá-Espinosa, A. Comparing time-series clustering algorithms in r using the dtwclust package. R package vignette. Pattern Recognit. 2017, 12, 41. [Google Scholar] [CrossRef]
  14. Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; PéRez, J.M.; Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013, 46, 243–256. [Google Scholar] [CrossRef]
  15. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
  16. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
  17. Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
  18. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 224–227. [Google Scholar] [CrossRef] [PubMed]
  19. Gurrutxaga, I.; Albisua, I.; Arbelaitz, O.; Martín, J.I.; Muguerza, J.; Pérez, J.M.; Perona, I. SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index. Pattern Recognit. 2010, 43, 3364–3373. [Google Scholar] [CrossRef]
  20. Kim, M.; Ramakrishna, R.S. New indices for cluster validity assessment. Pattern Recognit. Lett. 2005, 26, 2353–2363. [Google Scholar] [CrossRef]
Figure 1. Bukhan River drainage system (a). Map-based drainage system schematic and (b) schematic diagram of the drainage system.
Figure 1. Bukhan River drainage system (a). Map-based drainage system schematic and (b) schematic diagram of the drainage system.
Water 12 02411 g001
Figure 2. Map of missing data.
Figure 2. Map of missing data.
Water 12 02411 g002
Figure 3. Aligning rules for the (a) Euclidean (E) algorithm and (b) Dynamic Time Warping (DTW) algorithm. ‘E’ and ‘W’: lines showing mapping between two points that each methodology compares.
Figure 3. Aligning rules for the (a) Euclidean (E) algorithm and (b) Dynamic Time Warping (DTW) algorithm. ‘E’ and ‘W’: lines showing mapping between two points that each methodology compares.
Water 12 02411 g003
Figure 4. Illustration of global cost matrix and warping path of the classic Dynamic Time Warping (DTW) algorithm; (a). Two time series (b). Local cost matrix (c). Global cost matrix (d). Optimal warping path (e). DTW alignment.
Figure 4. Illustration of global cost matrix and warping path of the classic Dynamic Time Warping (DTW) algorithm; (a). Two time series (b). Local cost matrix (c). Global cost matrix (d). Optimal warping path (e). DTW alignment.
Water 12 02411 g004
Figure 5. Cluster formation using the (a) Euclidean algorithm and (b) Dynamic Time Warping (DTW) algorithm.
Figure 5. Cluster formation using the (a) Euclidean algorithm and (b) Dynamic Time Warping (DTW) algorithm.
Water 12 02411 g005
Figure 6. Cluster-specific boxplots for each variable when using the Euclidean algorithm about (a) pH, (b) DO, (c) BOD, (d) COD, (e) SS, (f) TN, (g) TP, (h) Temp, and (i) EC.
Figure 6. Cluster-specific boxplots for each variable when using the Euclidean algorithm about (a) pH, (b) DO, (c) BOD, (d) COD, (e) SS, (f) TN, (g) TP, (h) Temp, and (i) EC.
Water 12 02411 g006
Figure 7. Cluster-specific boxplots for each variable when using the Dynamic Time Warping (DTW) algorithm about (a) pH, (b) DO, (c) BOD, (d) COD, (e) SS, (f) TN, (g) TP, (h) Temp, and (i) EC.
Figure 7. Cluster-specific boxplots for each variable when using the Dynamic Time Warping (DTW) algorithm about (a) pH, (b) DO, (c) BOD, (d) COD, (e) SS, (f) TN, (g) TP, (h) Temp, and (i) EC.
Water 12 02411 g007
Figure 8. Cluster-specific time series plots for each variable observed between 2016 and 2018 when using the Euclidean algorithm about Cluster 1 (1), Cluster 2 (2), Cluster 3 (3), Cluster 4 (4), and Cluster 5 (5).
Figure 8. Cluster-specific time series plots for each variable observed between 2016 and 2018 when using the Euclidean algorithm about Cluster 1 (1), Cluster 2 (2), Cluster 3 (3), Cluster 4 (4), and Cluster 5 (5).
Water 12 02411 g008
Figure 9. Cluster-specific time series plots for each variable observed between 2016 and 2018 when using the Dynamic Time Warping (DTW) algorithm about Cluster 1 (1), Cluster 2 (2), Cluster 3 (3), Cluster 4 (4), and Cluster 5 (5).
Figure 9. Cluster-specific time series plots for each variable observed between 2016 and 2018 when using the Dynamic Time Warping (DTW) algorithm about Cluster 1 (1), Cluster 2 (2), Cluster 3 (3), Cluster 4 (4), and Cluster 5 (5).
Water 12 02411 g009
Table 1. Abbreviations of water quality monitoring stations of the Bukhan River.
Table 1. Abbreviations of water quality monitoring stations of the Bukhan River.
StationAbbreviation StationAbbreviation
Main
stream
(MS)
HwacheonHC (MS1)Right
tributary
(RT)
Gapyeongcheon3 (stream)GP (RT1)
Chuncheon ACA (MS2)Jojongcheon3 (stream)JJ (RT2)
UiamDamUD (MS3)Mukhyeoncheon (stream)MH (RT3)
CheongpyeongCP (MS4)Left
tributary
(LT)
Soyanggang2 (river)SY (LT1)
SambongliSB (MS5)Hongcheongang6 (river)HG (LT2)
PaldangDamPD (MS6)Byeoggyecheon (stream)BG (LT3)
Table 2. Characteristics of variables measured for the Bukhan River.
Table 2. Characteristics of variables measured for the Bukhan River.
VariableDescriptionMeanSD
pHHydrogen Ion Concentration 8.050.45
DODissolved Oxygenmg/L11.042.06
BODBiochemical Oxygen Demandmg/L0.980.92
CODChemical Oxygen Demandmg/L3.301.32
SSSuspended Solidmg/L4.6414.42
TNTotal Nitrogenmg/L2.461.99
TPTotal Phosphorusmg/L0.030.06
TempTemperature°C14.967.54
ECElectrical Conductivityμmhos/cm152.7686.24
Table 3. Characteristics of variables recorded by monitoring stations of the Bukhan River.
Table 3. Characteristics of variables recorded by monitoring stations of the Bukhan River.
Mainstream
HwacheonChuncheon AUiamDam
VariableMeanSDNMeanSDNMeanSDN
pH8.010.321497.830.251498.170.78150
DO11.101.7814910.432.1014911.602.68150
BOD0.530.251490.590.171491.310.55150
COD2.550.481492.740.411493.160.70150
SS1.921.591492.561.991494.059.37150
TN1.290.361491.310.331491.890.49150
TP0.010.011490.010.021490.020.02150
Temp13.916.2014913.715.8914912.896.64150
EC113.3119.73149106.1318.66149101.8314.69150
CheongpyeongSambongliPaldangDam
VariableMeanSDNMeanSDNMeanSDN
pH8.110.321497.790.441557.820.42155
DO10.861.8414910.401.7415510.102.40155
BOD1.000.491490.900.291551.160.38155
COD3.500.661493.460.491553.790.52155
SS3.914.661493.374.391555.163.52155
TN1.870.381491.880.331552.200.41155
TP0.020.021490.020.011550.030.02155
Temp16.587.4214915.647.6915513.787.72155
EC112.4116.06149127.5227.20155198.6237.69155
Right tributary
Gapyeongcheon3 (stream)Jojongcheon3 (stream)Mukhyeoncheon (stream)
VariableMeanSDNMeanSDNMeanSDN
pH8.150.341498.310.441497.820.30153
DO11.371.9014911.562.0314910.341.34153
BOD0.660.351490.940.401492.381.75153
COD2.621.631493.150.881496.171.94153
SS4.3221.121495.748.0014912.0838.83153
TN1.900.471492.600.811493.343.50153
TP0.010.031490.030.021490.100.13153
Temp17.368.3714917.618.6214918.296.11153
EC108.9428.74149182.7136.91149406.7191.45153
Left tributary
Soyanggang2 (river)Hongcheongang6 (river)Byeoggyecheon (stream)
VariableMeanSDNMeanSDNMeanSDN
pH8.000.401548.210.241498.280.33153
DO12.141.4215410.342.0014911.371.94153
BOD0.360.121540.620.311490.610.41153
COD2.780.411543.000.741492.781.12153
SS1.701.861542.945.911493.225.34153
TN1.590.151542.650.741491.790.55153
TP0.010.011540.020.021490.020.03153
Temp9.503.1415416.928.1214915.108.27153
EC79.888.13154180.5939.28149118.4831.62153
Table 4. Clustering Validation Index for the Euclidean and Dynamic Time Warping (DTW) Algorithms. Shading represents the largest value (DB, MDB, COP) or the smallest value (Sil, CH, D) of an index.
Table 4. Clustering Validation Index for the Euclidean and Dynamic Time Warping (DTW) Algorithms. Shading represents the largest value (DB, MDB, COP) or the smallest value (Sil, CH, D) of an index.
# of ClusterClustering Validation Index
SilCHDBMDBDCOP
Euclidean algorithm20.16012.4361.2571.2570.7010.668
30.1406.5781.2451.2450.7760.532
40.1314.4301.0541.0910.7760.464
50.0503.3840.9391.0660.6590.423
DTW algorithm20.1067.6491.4931.4930.7630.681
30.1283.9811.3381.3530.8240.594
40.1302.7371.1161.1440.8240.514
50.0443.0641.0281.1350.6240.476
Table 5. Post hoc test results for cluster analysis using the Euclidean and Dynamic Time Warping (DTW) algorithms.
Table 5. Post hoc test results for cluster analysis using the Euclidean and Dynamic Time Warping (DTW) algorithms.
Euclidean AlgorithmDTW Algorithm
pHCluster4231542315
Mean8.2518.1928.0177.9157.8268.2558.2428.1747.9327.829
Groupaabccaaabc
DOCluster2341524315
Mean11.79811.71811.60310.86310.59912.04711.61311.59811.07610.596
Groupaaabbaaabbcc
BODCluster5124353214
Mean2.4660.9950.8380.6520.4372.4201.3130.7230.7520.647
Groupabcdeabccdd
CODCluster5124353142
Mean6.2293.2743.0162.8832.6456.2003.1573.1162.8742.810
Groupabccdabbcc
SSCluster5214352314
Mean12.7534.6193.7442.9731.18012.3935.0174.0553.0163.004
Groupabbcbccabbbb
TNCluster5421352431
Mean8.8962.3482.2041.8381.4328.9552.3632.3481.8901.708
Groupabbcdabbcc
TPCluster5124352431
Mean0.1040.0190.0190.0180.0110.1020.0210.0180.0170.016
Groupabbbccabbbb
TempCluster5241352431
Mean16.72514.94513.24113.09711.00616.69115.26313.20512.88712.735
Groupaabbcaabbb
ECCluster5421354213
Mean418.594156.011137.684132.97696.678418.398155.536150.155123.336101.827
Groupabccdabbcd
Table 6. Cluster characteristics for the Euclidean and Dynamic Time Warping (DTW) algorithms.
Table 6. Cluster characteristics for the Euclidean and Dynamic Time Warping (DTW) algorithms.
ClusterStationVariableLocation
CharacteristicsCharacteristics
Euclidean algorithm1CA(MS2), UD(MS3), SB(MS5), PD(MS6)low: pH, DOMainstream
2GP(RT1), JJ(RT2), CP(MS4),high: pH, DO, TempRight tributary
Mainstream
Midstream
3SY(LT1), HC(MS1)high: DOLeft tributary
Mainstream
Upstream
low: BOD, COD, SS, TN, TP, Temp, EC
4BG(LT3), HG(LT2)high: pH, DOLeft downstream tributary
5MH(RT3)high: BOD, COD, SS, TN, TP, Temp, ECRight downstream tributary
low: pH, DO
DTW algorithm1HC(MS1), CA(MS2), CP(MS4), SB(MS5), PD(MS6), SY(LT1)low: TN, TempLeft tributary
Mainstream
2GP(RT1), JJ(RT2) high: pH, DO, TempRight midstream tributary
low: COD
3UD(MS3)high: pHMainstream
low: TN, Temp, EC
4HG(LT2), BG(LT3)high: pH, DOLeft downstream tributary
low: BOD, COD, Temp
5MH(RT3)high: BOD, COD, SS, TN, TP, Temp, ECRight downstream tributary
low: pH, DO

Share and Cite

MDPI and ACS Style

Lee, S.; Kim, J.; Hwang, J.; Lee, E.; Lee, K.-J.; Oh, J.; Park, J.; Heo, T.-Y. Clustering of Time Series Water Quality Data Using Dynamic Time Warping: A Case Study from the Bukhan River Water Quality Monitoring Network. Water 2020, 12, 2411. https://doi.org/10.3390/w12092411

AMA Style

Lee S, Kim J, Hwang J, Lee E, Lee K-J, Oh J, Park J, Heo T-Y. Clustering of Time Series Water Quality Data Using Dynamic Time Warping: A Case Study from the Bukhan River Water Quality Monitoring Network. Water. 2020; 12(9):2411. https://doi.org/10.3390/w12092411

Chicago/Turabian Style

Lee, Seulbi, Jaehoon Kim, Jongyeon Hwang, EunJi Lee, Kyoung-Jin Lee, Jeongkyu Oh, Jungsu Park, and Tae-Young Heo. 2020. "Clustering of Time Series Water Quality Data Using Dynamic Time Warping: A Case Study from the Bukhan River Water Quality Monitoring Network" Water 12, no. 9: 2411. https://doi.org/10.3390/w12092411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop