# An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China

^{1}

^{2}

^{*}

*Int. J. Environ. Res. Public Health*

**2015**,

*12*(11), 14400-14413; https://doi.org/10.3390/ijerph121114400

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Dataset

^{2}. The main stream runs through Hebei Province, Beijing City, Tianjin City and Shandong Province. The location of the river in China and the location of the monitoring stations are illustrated in Figure 1. The dataset from seven water quality monitoring stations on the Haihe River (Yanhecheng, Gubeikou, Gangnanshuiku, Guoheqiao, Sanchakou, Bahaoqiao and Chenggouwan), comprising four water quality indicators monitored weekly over eight years (2006–2013), was obtained from the Ministry of Environmental Protection of China. There were 2078 samples in all after eliminating unreasonable data and data worse than grade V. Samples in which one of the indicators exceeded the standard of grade V (i.e., grade VI) were not included in the analysis because most data worse than grade V were far from the boundaries and could be considered as outliers from a statistical point of view and would affect cluster quality. The available water quality indicators included pH, dissolved oxygen (DO), chemical oxygen demand (COD) and ammonia nitrogen (NH

_{3}-N). The surface water environmental quality standards (GB3838-2002) for DO, COD and NH

_{3}-N are listed in Table 1. The boundary values of DO, COD and NH

_{3}-N defined in Table 1 and the sample mean of pH were defined as original K cluster centroid. The descriptive statistics are summarized in Table 2. There are five grades in GB3838-2002 omitting grade VI.

Indicator | I | II | III | IV | V |
---|---|---|---|---|---|

DO (mg/L) | 7.5 | 6 | 5 | 3 | 2 |

COD (mg/L) | 2 | 4 | 6 | 10 | 15 |

NH_{3}-N (mg/L) | 0.15 | 0.5 | 1 | 1.5 | 2 |

Indicator | Mean | SD | SE | Minimum | Maximum |
---|---|---|---|---|---|

pH | 8.07 | 0.43 | 0.01 | 6.34 | 9.35 |

DO (mg/L) | 9.02 | 2.83 | 0.06 | 2.02 | 25.5 |

COD (mg/L) | 3.51 | 2.40 | 0.05 | 0.2 | 15 |

NH_{3}-N (mg/L) | 0.40 | 0.44 | 0.01 | 0.01 | 2 |

#### 2.2. Dataset Treatment

#### 2.3. Modified Indicator Weight Self-Adjustment K-Means Algorithm (MIWAS-K-Means)

## 3. Results and Discussion

#### 3.1. Evaluation Measures

Clustering Algorithms | K-Means | MIWAS-K-Means |
---|---|---|

SSE | 899.6053 | 782.2792 |

Number of iterations | 12 | 18 |

Final feature weights | (0.25,0.25,0.25,0. 25) | (0.1602,0.1978, 0.5116,0.1303) |

#### 3.2. Weights of Features

Indicators | pH | DO | COD | NH_{3}-N |
---|---|---|---|---|

Weights | 0.1602 | 0.1978 | 0.5116 | 0.1303 |

#### 3.3. Water Quality Classification

_{3}-N and highest values of DO. Values of COD and NH

_{3}-N become higher and higher while value of DO becomes lower and lower from cluster 1 to cluster 3. The mean COD in cluster 4 was higher than the value in cluster 5, while the means of NH

_{3}-N in cluster 5 was higher than the value in cluster 5. It could be inferred that samples in cluster 4 were mainly influenced by COD emissions, while samples in cluster 4 are mainly influenced by emissions of NH

_{3}-N.

Cl.1 | Cl.2 | Cl.3 | Cl.4 | Cl.5 | |
---|---|---|---|---|---|

pH | 7.89 ± 0.39 | 8.12 ± 0.40 | 8.12 ± 0.44 | 8.3 ± 0.35 | 7.7 ± 0.43 |

DO | 9.43 ± 1.99 | 9.65 ± 2.40 | 8.75 ± 2.44 | 8.49 ± 2.99 | 3.97 ± 1.61 |

COD | 1.45 ± 0.33 | 2.38 ± 0.37 | 4.71 + 0.11 | 8.95 ± 2.25 | 7.20 ± 2.46 |

NH_{3}-N | 0.17 ± 0.22 | 0.24 ± 0.28 | 0.54 ± 0.53 | 0.92 ± 0.85 | 1.37 ± 1.10 |

Number of cases | 502 | 700 | 545 | 194 | 137 |

#### 3.4. Verifying of Classification Accuracy

1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|

1 | 97.4 | 2.6 | 0 | 0 | 0 |

2 | 3.4 | 94.6 | 0 | 2 | 0 |

3 | 0 | 0 | 96.9 | 2.1 | 1 |

4 | 0 | 5 | 1.1 | 92.8 | 1.1 |

5 | 0 | 0 | 2.9 | 3.6 | 93.4 |

#### 3.5. Analysis of the Pollution Sources

_{3}-N is relatively lower of Yanhecheng, Gubeikou, Gangnanshuiku, Guoheqiao. The mean of DO is relatively lower and the value of COD and NH

_{3}-N is relatively higher in Sanchakou and Bahaoqiao. The mean of DO is lowest while the value of COD and NH

_{3}-N is highest in Chenggouwan.

pH | DO | COD | NH_{3}-N | |
---|---|---|---|---|

Yanhecheng | 8.22 ± 0.47 | 8.97 ± 1.94 | 3.31 ± 1.38 | 0.24 ± 0.18 |

Gubeikou | 7.91 ± 0.41 | 8.64 ± 1.79 | 2.10 ± 1.09 | 0.19 ± 0.10 |

Gangnanshuiku | 7.91 ± 0.29 | 9.74 ± 1.45 | 1.75 ± 0.30 | 0.07 ± 0.04 |

Guoheqiao | 8.18 ± 0.39 | 10.3 ± 3.04 | 2.56 ± 0.81 | 0.31 ± 0.17 |

Sanchakou | 8.19 ± 0.45 | 8.66 ± 3.7 | 6.95 ± 2.65 | 0.83 ± 0.70 |

Bahaoqiao | 7.88 ± 0.40 | 7.30 ± 1.81 | 4.54 ± 1.39 | 1.11 ± 0.79 |

Chenggouwan | 8.15 ± 0.46 | 4.22 ± 3.33 | 10.1 ± 3.38 | 1.92 ± 1.52 |

Sum | Cl.1 | Cl.2 | Cl.3 | Cl.4 | Cl.5 | |
---|---|---|---|---|---|---|

Yanhecheng | 353 | 32 | 150 | 155 | 14 | 2 |

Gubeikou | 372 | 160 | 166 | 42 | 2 | 2 |

Gangnanshuiku | 354 | 238 | 116 | 0 | 0 | 0 |

Guoheqiao | 392 | 68 | 240 | 82 | 1 | 1 |

Sanchakou | 326 | 0 | 9 | 102 | 143 | 72 |

Bahaoqiao | 238 | 4 | 19 | 163 | 23 | 29 |

Chenggouwan | 43 | 0 | 0 | 1 | 11 | 31 |

## 4. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Einax, J.W.; Zwanziger, H.W.; Geiβ, S. Chemometrics in Environmental Analysis; Wiley-VCH: Weinheim, Germany, 1997. [Google Scholar]
- Einax, J.W.; Truckenbrodt, D.; Kampe, O. River pollution data interpreted by means of chemometric methods. Microchem. J.
**1998**, 58, 315–324. [Google Scholar] [CrossRef] - Singh, K.P.; Malik, A.; Mohan, D.; Sinha, S. Multivariate statistical techniques for the evaluation of spatial and temporal variations in water quality of Gomti River (India)—A case study. Water Res.
**2004**, 38, 3980–3992. [Google Scholar] [CrossRef] [PubMed] - Kowalkowskia, T.; Zbytniewski, R.; Szpejna, J.; Buszewski, B. Application of chemometrics in river water classification. Water Res.
**2006**, 40, 744–752. [Google Scholar] [CrossRef] [PubMed] - Zhou, F.; Liu, Y.; Guo, H.C. Application of multivariate statistical methods to water quality assessment of the watercourse in northwestern new territories, Hong Kong. Environ. Monit. Assess.
**2007**, 132, 1–13. [Google Scholar] [CrossRef] [PubMed] - Xu, H.S.; Xu, Z.X.; Wu, W.; Tang, F.F. Assessment and spatiotemporal variation analysis of water quality in the Zhangweinan River Basin, China. Procedia Environ. Sci.
**2012**, 13, 1641–1652. [Google Scholar] [CrossRef] - Areerachakul, S.; Sanguansintukul, S. Clustering analysis of water quality for canals in bangkok, Thailand. In Computational Science and Its Applications CICCSA 2010; Springer: Berlin, Germany, 2010; Volume 6018, pp. 215–227. [Google Scholar]
- Kambatla, K.; Kollias, G.; Kumar, V.; Grama, A. Trends in big data analytics. J. Parallel Distrib. Comput.
**2014**, 74, 2561–2573. [Google Scholar] [CrossRef] - Mucherino, A.; Papajorgji, P.; Pardalos, M. A survey of data mining techniques applied to agriculture. Oper. Res.
**2009**, 9, 121–140. [Google Scholar] [CrossRef] - Shi, W.; Zeng, W. Application of K-means clustering to environmental risk zoning of the chemical industrial area. Front. Environ. Sci. Eng.
**2014**, 8, 117–127. [Google Scholar] [CrossRef] - Zhang, X.T.; Fang, K.T. An introduction to multivariate statistical analysis; Science Press: Beijing, China, 1982. (In Chinese) [Google Scholar]
- Fan, B.D. Fuzzy comprehensive evaluation model for groundwater quality. China Rural Water Hydropower
**1998**, 9, 29–32. (In Chinese) [Google Scholar] - Zhang, Q.Q.; Xu, Y.P.; Niu, S.F.; Lou, Z.H. Application of euclidean distance model with varying weights in comprehensive assessment of surface water quality. Acta Sci. Nat. Univ. Sunyatseni
**2010**, 49, 141–145. [Google Scholar] - Zou, Z.H.; Yun, Y.; Sun, J.N. Entropy method for determination of weight of evaluating indicators in fuzzy synthetic evaluation for water quality assessment. J. Environ. Sci.
**2006**, 18, 1020–1023. [Google Scholar] [CrossRef] - Ma, L.; Liu, Y.; Zhou, X.P. Fuzzy comprehensive evaluation method of F statistics weighting in identifying mine water inrush source. Int. J. Eng. Sci. Technol.
**2010**, 2, 123–128. [Google Scholar] [CrossRef] - Wang, X.J.; Zou, Z.H.; Zou, H. Water quality evaluation of Haihe River with fuzzy similarity measure methods. J. Environ. Sci.
**2013**, 25, 2041–2046. [Google Scholar] [CrossRef] - Wettschereck, D.; Aha, D.W.; Mohri, T. A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif. Intell. Rev.
**1997**, 11, 273–314. [Google Scholar] [CrossRef] - Modha, D.S.; Spangler, W.S. Feature weighting in K-means clustering. Mach. Learn.
**2003**, 52, 217–237. [Google Scholar] [CrossRef] - Tsai, C.Y.; Chiu, C.C. Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Comput. Stat. Data Anal.
**2008**, 52, 4658–4672. [Google Scholar] [CrossRef] - Guo, G.D.; Chen, S.; Chen, L.F. Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int. J. Mach. Learn. Cybern.
**2012**, 3, 39–49. [Google Scholar] [CrossRef] - Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. The KDD process for extracting useful knowledge from volumes of data. Commun. ACM.
**1996**, 39, 27–34. [Google Scholar] [CrossRef] - Dasu, T.; Johnson, T. Exploratory Data Mining and Data Cleaning; Wiley: New York, NY, USA, 2003. [Google Scholar]
- Papatheodorou, G.; Demopoulou, G.; Lambrakis, N. A long-term study of temporal hydrochemical data in a shallow lake using multivariate statistical techniques. Ecol. Modell.
**2006**, 193, 759–776. [Google Scholar] [CrossRef] - Liu, C.W.; Lin, K.H.; Kuo, Y.M. Application of factor analysis in the assessment of groundwater quality in a blackfoot disease area in Taiwan. Sci. Total Environ.
**2003**, 313, 77–89. [Google Scholar] [CrossRef] - Hartigan, J.A. Clustering Algorithms; Wiley: New York, NY, USA, 1975. [Google Scholar]
- Hillier, F.S.; Lieberman, G.J. Introduction to Operation Research; McGraw-Hill: New York, NY, USA, 2001. [Google Scholar]
- Maulik, U.; Bandyopadhyay, S. Performance evaluation of some clustering algorithms and validity indices. Pattern Anal. Mach. Intell.
**2002**, 24, 301–312. [Google Scholar] [CrossRef] - Mosteller, F. A k-sample slippage test for an extreme population. Ann. Math. Stat.
**1948**, 19, 58–65. [Google Scholar] [CrossRef]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zou, H.; Zou, Z.; Wang, X. An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China. *Int. J. Environ. Res. Public Health* **2015**, *12*, 14400-14413.
https://doi.org/10.3390/ijerph121114400

**AMA Style**

Zou H, Zou Z, Wang X. An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China. *International Journal of Environmental Research and Public Health*. 2015; 12(11):14400-14413.
https://doi.org/10.3390/ijerph121114400

**Chicago/Turabian Style**

Zou, Hui, Zhihong Zou, and Xiaojing Wang. 2015. "An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China" *International Journal of Environmental Research and Public Health* 12, no. 11: 14400-14413.
https://doi.org/10.3390/ijerph121114400