# Efficient Water Quality Prediction Using Supervised Machine Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- A first analysis was conducted on the available data to clean, normalize and perform feature selection on the water quality measures, and therefore, to obtain the minimum relevant subset that allows high precision with low cost. In this way, expensive and cumbersome lab analysis with specific sensors can be avoided in further similar analyses.
- A series of representative supervised prediction (classification and regression) algorithms were tested on the dataset worked here. The complete methodology is proposed in the context of water quality numerical analysis.
- After much experimentation, the results reflect that gradient boosting and polynomial regression predict the WQI best with a mean absolute error (MAE) of 1.9642 and 2.7273, respectively, whereas multi-layer perceptron (MLP) classifies the WQC best, with an accuracy of 0.8507.

## 2. Literature Review

## 3. Data Preprocessing

#### 3.1. Data Collection

#### 3.2. Boxplot Analysis and Outlier Detection

#### 3.3. Water Qualiity Index (WQI)

#### 3.4. Water Qulaity Class (WQC)

#### 3.5. Q-Value Normalization

#### 3.6. Z-Score Normalization

#### 3.7. Data Analysis

#### 3.7.1. Correlation Analysis

- Alkalinity (Alk) is highly correlated with hardness (CaCO
_{3}) and calcium (Ca). - Hardness is highly correlated with alkalinity and calcium, and loosely correlated with pH.
- Conductance is highly correlated with total dissolved solids, chlorides and fecal coliform count, and loosely correlated with calcium and temperature.
- Calcium is highly correlated with alkalinity and hardness, while loosely correlated with TDS, chlorides, conductance and pH.
- TDS is highly correlated with conductance, chlorides and fecal coliform, and loosely correlated with calcium and temperature.
- Chlorides are highly correlated with conductance and TDS, and loosely correlated with temperature, calcium and fecal coliform.
- Fecal coliform is correlated with conductance and TDS, and loosely correlated with chlorides.

_{3}, conductance, total dissolved solids and fecal coliform count. We have to choose the minimal number of parameters to predict the WQI, in order to lower the cost of the system. The three parameters whose sensors are easily available, cost the lowest and contribute distinctly to the WQI are temperature, turbidity and pH, which deems them naturally selected. The other convenient parameter is total dissolved solids, whose sensor is also easily available and is correlated with conductance and fecal coliform count, which means selecting TDS would allow us to discard the other two parameters. We leave the remaining inconvenient parameter, hardness as CaCO

_{3}, out because it is not highly correlated comparatively and is not easy to acquire.

#### 3.7.2. Data Splitting–Cross Validation

#### 3.7.3. Machine Learning Algorithms

## 4. Results

#### 4.1. Accuracy Measures

#### 4.2. Results for Regression Algorithms

#### 4.3. Results for Classification Algorithms

## 5. Discussion

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Conflicts of Interest

## References

- PCRWR. National Water Quality Monitoring Programme, Fifth Monitoring Report (2005–2006); Pakistan Council of Research in Water Resources Islamabad: Islamabad, Pakistan, 2007. Available online: http://www.pcrwr.gov.pk/Publications/Water%20Quality%20Reports/Water%20Quality%20Monitoring%20Report%202005-06.pdf (accessed on 23 August 2019).
- Mehmood, S.; Ahmad, A.; Ahmed, A.; Khalid, N.; Javed, T. Drinking Water Quality in Capital City of Pakistan. Open Access Sci. Rep.
**2013**, 2. [Google Scholar] [CrossRef] - PCRWR. Water Quality of Filtration Plants, Monitoring Report; PCRWR: Islamabad, Pakistan, 2010. Available online: http://www.pcrwr.gov.pk/Publications/Water%20Quality%20Reports/FILTRTAION%20PLANTS%20REPOT-CDA.pdf (accessed on 23 August 2019).
- Gazzaz, N.M.; Yusoff, M.K.; Aris, A.Z.; Juahir, H.; Ramli, M.F. Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors. Mar. Pollut. Bull.
**2012**, 64, 2409–2420. [Google Scholar] [CrossRef] - Daud, M.K.; Nafees, M.; Ali, S.; Rizwan, M.; Bajwa, R.A.; Shakoor, M.B.; Arshad, M.U.; Chatha, S.A.S.; Deeba, F.; Murad, W.; et al. Drinking water quality status and contamination in Pakistan. BioMed Res. Int.
**2017**, 2017, 7908183. [Google Scholar] [CrossRef] - Alamgir, A.; Khan, M.N.A.; Hany; Shaukat, S.S.; Mehmood, K.; Ahmed, A.; Ali, S.J.; Ahmed, S. Public health quality of drinking water supply in Orangi town, Karachi, Pakistan. Bull. Environ. Pharmacol. Life Sci.
**2015**, 4, 88–94. [Google Scholar] - Shafi, U.; Mumtaz, R.; Anwar, H.; Qamar, A.M.; Khurshid, H. Surface Water Pollution Detection using Internet of Things. In Proceedings of the 2018 15th International Conference on Smart Cities: Improving Quality of Life Using ICT & IoT (HONET-ICT), Islamabad, Pakistan, 8–10 October 2018; pp. 92–96. [Google Scholar]
- Ahmad, Z.; Rahim, N.; Bahadori, A.; Zhang, J. Improving water quality index prediction in Perak River basin Malaysia through a combination of multiple neural networks. Int. J. River Basin Manag.
**2017**, 15, 79–87. [Google Scholar] [CrossRef] - Sakizadeh, M. Artificial intelligence for the prediction of water quality index in groundwater systems. Model. Earth Syst. Environ.
**2016**, 2, 8. [Google Scholar] [CrossRef] - Abyaneh, H.Z. Evaluation of multivariate linear regression and artificial neural networks in prediction of water quality parameters. J. Environ. Health Sci. Eng.
**2014**, 12, 40. [Google Scholar] [CrossRef] - Ali, M.; Qamar, A.M. Data analysis, quality indexing and prediction of water quality for the management of rawal watershed in Pakistan. In Proceedings of the Eighth International Conference on Digital Information Management (ICDIM 2013), Islamabad, Pakistan, 10–12 September 2013; pp. 108–113. [Google Scholar]
- Ranković, V.; Radulović, J.; Radojević, I.; Ostojić, A.; Čomić, L. Neural network modeling of dissolved oxygen in the Gruža reservoir, Serbia. Ecol. Model.
**2010**, 221, 1239–1244. [Google Scholar] [CrossRef] - Kangabam, R.D.; Bhoominathan, S.D.; Kanagaraj, S.; Govindaraju, M. Development of a water quality index (WQI) for the Loktak Lake in India. Appl. Water Sci.
**2017**, 7, 2907–2918. [Google Scholar] [CrossRef] [Green Version] - Thukral, A.; Bhardwaj, R.; Kaur, R. Water quality indices. Sat
**2005**, 1, 99. [Google Scholar] - Srivastava, G.; Kumar, P. Water quality index with missing parameters. Int. J. Res. Eng. Technol.
**2013**, 2, 609–614. [Google Scholar] - Jayalakshmi, T.; Santhakumaran, A. Statistical normalization and back propagation for classification. Int. J. Comput. Theory Eng.
**2011**, 3, 1793–8201. [Google Scholar] - Amral, N.; Ozveren, C.; King, D. Short term load forecasting using multiple linear regression. In Proceedings of the 2007 42 nd International Universities Power Engineering Conference, Brighton, UK, 4–6 September 2007; pp. 1192–1198. [Google Scholar]
- Ostertagová, E. Modelling using polynomial regression. Procedia Eng.
**2012**, 48, 500–506. [Google Scholar] [CrossRef] - Liaw, A.; Wiener, M. Classification and regression by randomForest. R News
**2002**, 2, 18–22. [Google Scholar] - Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal.
**2002**, 38, 367–378. [Google Scholar] [CrossRef] - Tong, S.; Koller, D. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res.
**2001**, 2, 45–66. [Google Scholar] - Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics
**1970**, 12, 55–67. [Google Scholar] [CrossRef] - Zhang, Y.; Duchi, J.; Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res.
**2015**, 16, 3299–3340. [Google Scholar] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Zou, H.; Hastie, T. Regression shrinkage and selection via the elastic net, with applications to microarrays. J. R. Stat. Soc. Ser. B
**2003**, 67, 301–320. [Google Scholar] [CrossRef] - Günther, F.; Fritsch, S. Neuralnet: Training of neural networks. R J.
**2010**, 2, 30–38. [Google Scholar] [CrossRef] - Zhang, H. The optimality of naive Bayes. AA
**2004**, 1, 3. [Google Scholar] - Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010, Paris, France, 22–27August 2010; pp. 177–186. [Google Scholar]
- Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. When is “nearest neighbor” meaningful? In Proceedings of the International Conference on Database Theory, Jerusalem, Israel, 10–12 January 1999; pp. 217–235. [Google Scholar]
- Quinlan, J.R. Decision trees and decision-making. IEEE Trans. Syst. Man Cybern.
**1990**, 20, 339–346. [Google Scholar] [CrossRef] - Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] [Green Version] - Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res.
**2005**, 30, 79–82. [Google Scholar] [CrossRef] - Menard, S. Coefficients of determination for multiple logistic regression analysis. Am. Stat.
**2000**, 54, 17–24. [Google Scholar] - Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Hobart, Australia, 4–8 December 2006; pp. 1015–1021. [Google Scholar]
- Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; pp. 345–359. [Google Scholar]

**Table 1.**Parameters along with their “WHO” standard limits [11].

Parameter | WHO Limits |
---|---|

Alkalinity | 500 mg/L |

Appearance | Clear |

Calcium | 200 mg/L |

Chlorides | 200 mg/L |

Conductance | 2000 µS/cm |

Fecal Coliforms | Nil Colonies/100 mL |

Hardness as CaCO_{3} | 500 mg/L |

Nitrite as NO_{2}^{−} | <1 mg/L |

pH | 6.5–8.5 |

Temperature | °C |

Total Dissolved Solids | 1000 mg/L |

Turbidity | 5 NTU |

Weighing Factor | Weight |
---|---|

pH | 0.11 |

Temperature | 0.10 |

Turbidity | 0.08 |

Total Dissolved Values | 0.07 |

Nitrates | 0.10 |

Fecal Coliform | 0.16 |

Water Quality Index Range | Class |
---|---|

0–25 | Very bad |

25–50 | Bad |

50–70 | Medium |

70–90 | Good |

90–100 | Excellent |

Temp | Turb | pH | Alk | CaCO_{3} | Cond | Ca | TDS | Cl | NO_{2} | FC | WQI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Temp | 1.000 | 0.103 | 0.005 | −0.193 | −0.288 | 0.266 | −0.150 | 0.274 | 0.293 | −0.154 | 0.194 | −0.467 |

Turb | 0.103 | 1.000 | −0.0886 | −0.093 | −0.146 | 0.048 | −0.122 | 0.042 | 0.037 | 0.0002 | 0.037 | −0.354 |

pH | 0.005 | −0.088 | 1.000 | −0.177 | −0.278 | −0.065 | −0.236 | −0.060 | −0.149 | 0.167 | 0.054 | −0.431 |

Alk | −0.193 | −0.092 | −0.177 | 1.000 | 0.462 | 0.011 | 0.444 | 0.012 | 0.061 | 0.046 | 0.013 | 0.223 |

CaCO_{3} | −0.288 | −0.146 | −0.278 | 0.462 | 1.000 | 0.068 | 0.637 | 0.060 | 0.135 | 0.078 | 0.016 | 0.360 |

Cond | 0.266 | 0.048 | −0.064 | 0.011 | 0.068 | 1.000 | 0.225 | 0.973 | 0.780 | 0.100 | 0.456 | −0.370 |

Ca | −0.150 | −0.122 | −0.236 | 0.444 | 0.637 | 0.225 | 1.000 | 0.219 | 0.262 | 0.124 | 0.113 | 0.188 |

TDS | 0.273 | 0.041 | −0.060 | 0.012 | 0.060 | 0.974 | 0.219 | 1.000 | 0.765 | 0.095 | 0.454 | −0.381 |

Cl | 0.292 | 0.037 | −0.149 | 0.061 | 0.135 | 0.780 | 0.262 | 0.765 | 1.000 | 0.036 | 0.353 | −0.274 |

NO_{2} | −0.154 | 0.0002 | 0.167 | 0.046 | 0.078 | 0.100 | 0.124 | 0.095 | 0.036 | 1.000 | 0.193 | −0.209 |

FC | 0.194 | 0.037 | 0.053 | 0.012 | 0.016 | 0.456 | 0.113 | 0.454 | 0.353 | 0.193 | 1.000 | −0.421 |

WQI | −0.467 | −0.354 | −0.431 | 0.223 | 0.360 | −0.370 | 0.188 | −0.381 | −0.274 | −0.209 | −0.421 | 1.000 |

Algorithm | MAE | MSE | RMSE | R Squared |
---|---|---|---|---|

Linear Regression | 2.6312 | 11.7550 | 3.4286 | 0.6573 |

Polynomial Regression | 2.0037 | 7.9467 | 2.8190 | 0.7134 |

Random Forest | 2.3053 | 9.5669 | 3.0930 | 0.6705 |

Gradient Boosting | 1.9642 | 7.2011 | 2.6835 | 0.7485 |

SVM | 2.4373 | 10.6333 | 3.2609 | 0.3458 |

Ridge Regression | 2.6323 | 11.7500 | 3.4278 | 0.4971 |

Lasso Regression | 3.5850 | 20.1185 | 4.4854 | −2.9327 |

Elastic Net Regression | 3.6595 | 20.9698 | 4.5793 | −4.0050 |

Algorithm | MAE | MSE | RMSE | R Squared |
---|---|---|---|---|

Linear Regression | 3.1375 | 15.8321 | 3.9790 | 0.5384 |

Polynomial Regression | 2.7273 | 12.7307 | 3.5680 | 0.4851 |

Random Forest | 3.0404 | 15.2473 | 3.9048 | 0.4107 |

Gradient Boosting | 2.8060 | 13.2710 | 3.6429 | 0.5051 |

SVM | 2.8252 | 13.8546 | 3.7222 | 0.1546 |

Ridge Regression | 3.1386 | 15.8327 | 3.9790 | 0.2031 |

Lasso Regression | 3.8800 | 22.9966 | 4.7955 | −3.6636 |

Elastic Net Regression | 3.9697 | 24.0678 | 4.9059 | −5.5210 |

Algorithm | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|

MLP | 0.8507 | 0.5659 | 0.5640 | 0.5649 |

Guassian Naïve Bayes | 0.7843 | 0.4964 | 0.5491 | 0.5025 |

Logistic Regression | 0.8401 | 0.5520 | 0.5594 | 0.5548 |

Stochastic Gradient Descent | 0.8205 | 0.5473 | 0.5424 | 0.5443 |

KNN | 0.7270 | 0.4734 | 0.4783 | 0.4750 |

Decision Tree | 0.7949 | 0.5298 | 0.5250 | 0.5268 |

Random Forest | 0.7587 | 0.5063 | 0.5011 | 0.5027 |

SVM | 0.7979 | 0.5187 | 0.5327 | 0.5228 |

Gradient Boosting Classifier | 0.8130 | 0.5375 | 0.5376 | 0.5376 |

Bagging Classifier | 0.8100 | 0.5410 | 0.5354 | 0.5374 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ahmed, U.; Mumtaz, R.; Anwar, H.; Shah, A.A.; Irfan, R.; García-Nieto, J.
Efficient Water Quality Prediction Using Supervised Machine Learning. *Water* **2019**, *11*, 2210.
https://doi.org/10.3390/w11112210

**AMA Style**

Ahmed U, Mumtaz R, Anwar H, Shah AA, Irfan R, García-Nieto J.
Efficient Water Quality Prediction Using Supervised Machine Learning. *Water*. 2019; 11(11):2210.
https://doi.org/10.3390/w11112210

**Chicago/Turabian Style**

Ahmed, Umair, Rafia Mumtaz, Hirra Anwar, Asad A. Shah, Rabia Irfan, and José García-Nieto.
2019. "Efficient Water Quality Prediction Using Supervised Machine Learning" *Water* 11, no. 11: 2210.
https://doi.org/10.3390/w11112210