Identifying the Most Discriminative Parameter for Water Quality Prediction Using Machine Learning Algorithms

: Groundwater quality is one of the major concerns. Quality of the groundwater directly impacts human health, growth of plants and vegetables. Due to the severe impacts of inadequate water quality, it is imperative to find a swift and economical solution. Water quality prediction may help us to manage water resources properly. The present study has been carried out considering thirty-seven water sample data points form the Pindrawan tank command area of Raipur district, Chhattisgarh, India. A total of nineteen physicochemical parameters were measured, out of which seventeen parameters were used to compute the weight-based groundwater quality index (WQI). In this present work, the primary goal is to identify the most effective parameters for WQI prediction. Out of the seventeen parameters tested, the Mann—Whitney—Wilcoxon (MWW) statistical test has revealed that five parameters Fe, Cr, Na, Ca, and Mg hold a strong statistical significance in distinguishing between drinkable and non-drinkable water. Out of these five parameters, Cr is the only parameter that maintains a different range of values for drinkable water and non-drinkable water. To validate the efficiency of these statistically significant parameters, machine learning techniques like Artificial Neural Networks (ANN) and Logistic Regression (LR) were used. The experimental results clearly demonstrate that out of all the seventeen parameters tested, utilizing only Cr yields remarkably high classification accuracy. ‘Cr’ achieved an accuracy of 91.67% using artificial neural networks. This is much higher than the accuracy of 66.67% obtained using a parameter set with all seventeen parameters. The proposed methodology achieved good accuracy when classifying water samples into drinkable and non-drinkable water using only one parameter, ‘Cr’.


Introduction
Groundwater is an invaluable resource that benefits both humans and plants.A significant proportion of Indians continue to rely on groundwater for drinking.Agriculture, which involves both growing crops and raising animals; and forestry, which involves the careful management of forests and woodlands, rely entirely on groundwater resources.The quality of drinking water profoundly affects human health.As reported in [1], about 2.5 billion illnesses and 5 million fatalities are water-borne diseases, that account for 80% of infections in developing countries.So, maintaining the quality of the groundwater is very crucial for human as well as plant health [2].As found in the literature [3][4][5], many researchers have conducted groundwater quality-based studies.The in-depth investigations primarily concentrated on a specific geographical location, facilitating precise predictions of water quality and efficient management of water resources in that particular area [6].Testing the water quality for every region is a tedious job, and it also costs an ample amount of money.Most existing studies [7][8][9] have examined over ten parameters to measure how they affect the Water Quality Index.The WQI plays a major role in this context.In the last decades, different types of water quality measurement techniques have been proposed and used.Most of these techniques use the weight of the parameters to calculate the WQI index.WQI is defined as a rating reflecting the composite influence of different water quality parameters [10].WQI is calculated from the point of view of the suitability of surface water for human consumption [11].To find out WQI, the authors used the same method used by Yisa et al. (2010) in [12].In a review paper by Yan et al. (2022) [13] multiple methods of finding WQI have been discussed and addressed.Water quality degradation can have various negative impacts on the socio-economic aspects, such as soil erosion and changes in rainfall patterns.The consequences can be very dangerous [14].So many counties monitor the water quality by monitoring certain physicochemical parameters [15,16].This study aims to highlight the significance of proposing and evaluating a highly effective ML-based approach for accurately predicting real-time water quality.Kord et al. (2022) [17] used 8 parameters including Na, Ca, Mg, Cl, SO 4 , pH, TH, and TDS to model the WQI.They modelled the WQI by using neural networks, with precipitation and water-table fluctuation as inputs.In [18], authors simulate and predict the pH, EC, TDS, TH, PLI, MHMI, and SPI using multiple linear regression (MLR) and artificial neural network algorithms.In a case study-based model to predict the groundwater quality index of Rafsanian Plain, authors used electrical conductivity, total hardness, total dissolved solid, pH, chloride, bicarbonate, sulfate, phosphate, calcium, magnesium, potassium, and sodium as the inputs [19].In a novel prediction-based model [20], authors analyzed the increase in WQI for nine wells.They found that in five wells, the WQI increased significantly due to the presence of Mn, NH 4 , and NO 3 .Singha et al. (2021) [5] employed a highly effective machine-learning (ML) method to accurately anticipate the quality of groundwater.Instead of using the traditional method of calculating the Water Quality Index (WQI), the authors used a different method called the Entropy Weight-based Groundwater Quality Index (EWQI).This method is based on thirteen physicochemical parameters.ML algorithms such as the Naive Bayes classifier (NBC) and support vector machine (SVM) can accurately estimate and predict the Water Quality Index.These algorithms use measured field parameters to accurately predict the WQI.This eliminates the need for extra time and effort [21].Gupta et al. (2019) [22] introduced an innovative cascade forward approach in their study, aiming to predict groundwater quality with exceptional accuracy.The method, based on advanced ANN delivers outstanding predictability.Sakizadeh conducted a study in 2016 where they successfully employed Bayesian regularization in an ANN model to make highly accurate forecasts of WQI.The results were highly impressive, with a prediction performance of 94% (R2) achieved.This clearly demonstrates the effectiveness and reliability of the model [23].ML models have been utilized to predict the hardness of groundwater.They compared the performance of two models, boosted regression trees (BRT) and random forest (RF), using multivariate discriminant analysis (MDA).Using hardness values from 135 groundwater quality monitoring wells and using 11 predictor variables, an accurate determination of hard and soft water has been achieved [24].Recent advances in ML techniques provide valuable tools for producing an artificial groundwater recharge site suitability map (AGRSSM).A ML algorithm was developed to determine the ideal site for an agricultural groundwater recharge (AGR) project in Iranshahr.The algorithm used nine digitized and geo-referenced data layers.It was trained and validated using 1000 randomly selected points from the study area.The algorithm achieved an accuracy of 97% [25].Mostly all the papers used more than six or seven parameters to conclude the groundwater quality.Collecting data and conducting laboratory tests can be incredibly time-consuming and financially burdensome.Additionally, the management of data often poses its own set of challenges [26].Different ML techniques use the water body's location Water 2024, 16, 481 3 of 11 and elevation as inputs to accurately predict contamination levels.The results are reviewed and analyzed according to groundwater contamination and the chemical composition of the groundwater location.Predictions for pH, temperature, turbidity, dissolved oxygen, hardness, chlorides, alkalinity, and chemical oxygen demand rely on unalterable parameters like latitude, longitude, and elevation [27].Gaagai et al. (2023) studied groundwater quality for irrigation in the Sahara aquifer in Algeria.They used various methods including irrigation water quality indices (IWQIs), ANN models, and Gradient Boosting Regression (GBR).They also employed multivariate statistical analysis and a geographic information system (GIS).They analysed 27 groundwater samples using traditional techniques.To enhance IWQI, they applied two ML models: ANN and GBR.Their analysis showed that the ANN model outperformed the GBR model, yielding exceptional results [28].Furthermore, it is worth noting that the process typically requires several years to yield conclusive results.Additionally, it is important to consider that throughout this timeframe, the impact on different regions may undergo modifications [29].In their study, Asadollah et al. (2021) examined 10 different water quality parameters in order to assess the monthly water quality of the Lam Tsuen River in Hong Kong [30].Researchers frequently attempt to minimize the number of parameters in order to reduce both costs and the size of the dataset [31].Knowledge driven and machine learning decision tree-based approach, ANN and deep learning were used in different papers to identify the water quality [32][33][34].Many studies neglect to capitalize on available data sets when choosing parameters for their research, resulting in inflated costs for the entire project [35].
Based on the literature survey, it has been observed that in order to predict the quality of groundwater, it is essential to analyze multiple parameters and compile the data to make accurate predictions [7,10].Often this process becomes tedious, time-consuming, manpower-intensive and requires huge amounts of funding to check all the parameters.Because contaminants vary in weight, it is not sufficient to rely on just one or two parameters.It would be more effective to identify the most influential parameter on groundwater contamination using statistical and ML techniques [36].Previous studies in this area have not provided much insight.
This study aims to uncover the fundamental physicochemical parameters that have a significant impact on water quality in order to address the issues previously mentioned.So, in this recent work, the novelty of the problem is described as follows.Firstly, the authors try to address the problem of identifying the most impactful parameters for analyzing water quality.Furthermore, the authors thoroughly examine the selected optimal parameters in order to unveil the fundamental factor that genuinely differentiates drinkable water from non-drinkable water.Importantly, the authors did so by using existing data instead of creating a new dataset.The Mann-Whitney-Wilcoxon test was used to find statistically significant discriminative parameters for water quality estimation.The authors identified five parameters with clear statistical significance for predicting water quality, out of a total of nineteen present in the dataset.Cr has a distinct range of values for drinkable and non-drinkable water, among these five discriminating parameters.Moreover, the chosen discriminative features were assessed using machine learning techniques to determine their effectiveness in predicting water quality.
The remaining sections of the paper are structured as follows.In Section 2, the experimental dataset and methodology are described.Section 3 provides and discusses the experimental results of the proposed system.It also discusses the results obtained with the different experimental setups.Finally, Section 4 ends with a conclusion of the current work and some notes on future enhancements.

Experimental Dataset
For the experimental purpose, we have considered a dataset comprising 37 samples collected from the Pindrawan tank command area of Raipur district, Chhattisgarh, India.Each sample is represented by using 19 significant parameters including pH, Cond, TDS, Alk, Cl − , Hd, F, Fe, Cr, Na, K, CO 3 , HCO 3 , Cl, Ca, Mg, HNO 3 , Fluoride, and SO 4 .For labelling of the samples into drinkable and non-drinkable water, we referred to the work published in [12].In [12], the authors proposed an approach to water quality classification based on the water quality index.Based on the WQI value, the water can be divided into five categories: excellent (WQI < 50), good water (50 < WQI < 100), poor water (100 < WQI < 200), very poor water (200 < WQI < 300), and water unsuitable for drinking (WQI > 300).Based on the parameters of our experimental dataset, we have also calculated the WQI value of each sample.However, due to the small size of our experimental dataset, instead of categorizing data into five classes, we categorize them into 2 classes: drinkable (WQI < 100) and non-drinkable (WQI > 100).In order to improve the accuracy of our predictions, it is important to maintain a balanced experimental dataset.To conduct experiments effectively, the experimental dataset is divided into two distinct sets: a training set and a testing set at a ratio of 8:2 (29 and 8 for training and testing, respectively).

Machine Learning Algorithms
As reported in the literature, authors have used ANN, multiple linear regression, RF, GBR, SVM, etc., for their research works.Being a classification task, the regression models are not applicable to our research work.Moreover, RF require a large amount of data to produce reliable results, and they may not perform well on small datasets [37].Similarly, the SVM was overfitted due to very small experimental dataset.Considering that, along with ANN, we have evaluated the performance of logistic regression and K-nearest neighbour (KNN) which are good for a small experimental dataset [38,39] to evaluate the performance of the parameters used in water quality identification.This evaluation allowed us to determine the efficiency of these parameters and draw insightful conclusions.A brief description of each of these methods is provided below.
(1) Artificial Neural Network: ANN resembles how neuronal arrays work in biological learning and memory [40].A neural network is a bio-inspired system made up of several neurons, which are single processing units.The neurons are coupled utilizing joint mechanisms with specific weights.As reported in the literature, it has been observed that ANN provided better performance over the other ML techniques [22].In contrast to other ML techniques, ANN is also discovered to be a little bit more accurate with noisy data.With the development of technology and datasets, ANN proves effective in this context.ANN architecture typically comprises of an input layer, a hidden layer and an output layer [40].The connection between two layers is represented by biases and weights.If X i is the input at neuron i, b j is the bias of neuron j, and the weight connection from neuron i to neuron j is denoted by W ij , then the activation at the jth neuron is given by the following equation: (2) Logistic Regression: LR is a supervised ML algorithm.Its name is logistic regression because its foundation is the logistic or sigmoid function.It is one of the most widely used algorithms for binary classification.LR uses a method known as maximum likelihood estimation to find the model equation as follows: where: • X j : The jth predictor variable • β j : The coefficient estimate for the jth predictor variable (3) K-nearest neighbour: KNN is one of the simplest and widely used supervised ML algorithms.It is a non-parametric and lazy learning algorithm meaning it does not explicitly learn a model during the training phase.Instead, it memorizes the entire training dataset and makes predictions by measuring the similarity of new datapoints with the training samples.Various similarity or distance metrics like Euclidean distance, Manhattan distance and Minkowski distance are there for measuring the similarity between the training samples and new data points.

Proposed Methodology
Our proposed method for predicting water quality begins with a pre-processing phase.This is followed by identifying the most crucial parameters that determine water quality.We then classify samples into categories of drinkable and non-drinkable water.Finally, we evaluate the performance of our designed system to ensure its effectiveness.The flow of the proposed methodology is provided in Figure 1.
• βj: The coefficient estimate for the jth predictor variable (3) K-nearest neighbour: KNN is one of the simplest and widely used supervised ML algorithms.It is a non-parametric and lazy learning algorithm meaning it does not explicitly learn a model during the training phase.Instead, it memorizes the entire training dataset and makes predictions by measuring the similarity of new datapoints with the training samples.Various similarity or distance metrics like Euclidean distance, Manhattan distance and Minkowski distance are there for measuring the similarity between the training samples and new data points.

Proposed Methodology
Our proposed method for predicting water quality begins with a pre-processing phase.This is followed by identifying the most crucial parameters that determine water quality.We then classify samples into categories of drinkable and non-drinkable water.Finally, we evaluate the performance of our designed system to ensure its effectiveness.The flow of the proposed methodology is provided in Figure 1.

Pre-processing
In the experimental dataset, there are some missing values in the ' F − ' column, for which authors have dropped the column.Like ' F − ', the 'CO3′ column was also dropped as all entries in the 'CO3′ columns have the same value '0′.After removing the missing values, the data was further enhanced through the application of Standard Scaler (SS) for data normalization.The average values of the parameters for drinkable and non-drinkable water are shown in Table 1.

Pre-processing
In the experimental dataset, there are some missing values in the 'F − ' column, for which authors have dropped the column.Like 'F − ', the 'CO 3 ′ column was also dropped as all entries in the 'CO 3 ′ columns have the same value '0 ′ .After removing the missing values, the data was further enhanced through the application of Standard Scaler (SS) for data normalization.The average values of the parameters for drinkable and non-drinkable water are shown in Table 1.

Parameter Selection
In order to achieve a high level of accuracy in classification, it is essential to focus on parameter selection.As mentioned before, the experimental dataset has been updated with a total of 17 parameters after removing the ′ F − ′ and ′ CO ′ 3 parameters.As mentioned earlier, the number of samples are low (<38) in each category of our experimental dataset.Moreover, the sample distributions of each parameter are not normally distributed and they are independent.Considering the properties of the samples, the MWW test [41] was used to identify the relevant parameters and discard irrelevant parameters.The MWW test is also referred to as the Wilcoxon Rank Sum test.The MWW test found only those parameters that reached significant differences (p < 0.005) and these parameters are considered as the most discriminative parameters for water quality prediction.The significance level of each parameter value is measured against the null hypothesis and tabulated in Table 1.As illustrated in Table 1, although there are significant differences between the mean values of the parameters of the drinkable and non-drinkable water, only five parameters: Fe, Cr, Na, Ca, and Mg are found to be statistically significant (p < 0.05) in discriminating the drinkable water from the non-drinkable water.The p-value of these parameters reach the significant difference of p < 0.005 and are marked as significant in the fourth column in Table 1.The null hypothesis is Ho: The median of the non-drinkable water is less than the median of the drinkable water.But, the MWW test in most of the parameters rejected the null hypothesis by accepting the alternative hypothesis Ha: The median of the non-drinkable water is greater than the median of the drinkable water.
Furthermore, Figure 2 showcases a visually captivating box plot that effectively showcases the complete range of significant parameter values for both drinkable and nondrinkable water.Figure 2a illustrates that there is a shared range of Ca values between non-drinkable water and drinkable water.For Fe, Ca, Mg, and Na, the drinkable and nondrinkable water have a similar range of values.However, these values are not enough to distinguish between drinkable and non-drinkable water, even though they are statistically significant.The range of only one parameter, Cr, differs between drinkable and nondrinkable water.This difference is depicted in Figure 2b.We will assess the effectiveness of this single parameter in predicting water quality using different ML algorithms.

Model Building
As reported in earlier sections, authors have evaluated the efficiency of the statistically significant parameters by using three ML algorithms: (1) ANN, (2) LR and (3) KNN.For the experimental purpose, a 4-layered ANN comprising 1 input layer, 2 hidden layers, and 1 output layer was built.Depending on the experimental setup, the number of neurons in the input layer varies.After evaluating the performance of various numbers of neurons in the hidden layer, it has been found that the highest accuracy was achieved by having 6 neurons in both the first and second hidden layers.In the hidden layers, we used the 'Relu' activation function and in the output layer, the 'SoftMax' activation function was used.For designing the LR model, the performance of LR for various set of parameters were evaluated, out of which the highest classification accuracy was obtained with C = 1.0, penalty = 'l2 ′ , tolerance = 0.0001.Like ANN and LR, the performance of KNN was also evaluated for different combinations of k values and the distance metrics.After trial and error method, the highest accuracy of KNN was obtained with k = 7 and Minkowski distance metric.

Model Building
As reported in earlier sections, authors have evaluated the efficiency of the statistically significant parameters by using three ML algorithms: (1) ANN, (2) LR and (3) KNN.For the experimental purpose, a 4-layered ANN comprising 1 input layer, 2 hidden layers, and 1 output layer was built.Depending on the experimental setup, the number of neurons in the input layer varies.After evaluating the performance of various numbers of neurons in the hidden layer, it has been found that the highest accuracy was achieved by having 6 neurons in both the first and second hidden layers.In the hidden layers, we used the 'Relu' activation function and in the output layer, the 'SoftMax' activation function was used.For designing the LR model, the performance of LR for various set of parameters were evaluated, out of which the highest classification accuracy was obtained with C = 1.0, penalty = 'l2′, tolerance = 0.0001.Like ANN and LR, the performance of KNN was also evaluated for different combinations of k values and the distance metrics.After trial and error method, the highest accuracy of KNN was obtained with k = 7 and Minkowski distance metric.

Results and Discussion
To perform an efficient performance evaluation of the proposed findings, authors evaluate the classification performance of each ML model by using 3 different setups: (a) Performance evaluation of the ML models by using all 17 raw parameters (b) Performance evaluation of the ML models by using only statistically significant parameters (c) Performance evaluation of the ML models by using the most statistically significant parameter Cr.Authors have utilized four of the most commonly used metrics-accuracy (ACC), recall (RC) or sensitivity (SN), specificity (SP), and precision (PR)-to thoroughly assess the classification algorithms' performance.The mathematical formula for each of these metrics is provided below:

Results and Discussion
To perform an efficient performance evaluation of the proposed findings, authors evaluate the classification performance of each ML model by using 3 different setups: (a) Performance evaluation of the ML models by using all 17 raw parameters (b) Performance evaluation of the ML models by using only statistically significant parameters (c) Performance evaluation of the ML models by using the most statistically significant parameter Cr.
Authors have utilized four of the most commonly used metrics-accuracy (ACC), recall (RC) or sensitivity (SN), specificity (SP), and precision (PR)-to thoroughly assess the classification algorithms' performance.The mathematical formula for each of these metrics is where, TP is the number of non-drinkable water samples classified as non-drinkable.TN is the number of drinkable water samples classified as drinkable.
FP is the number of drinkable water samples classified as non-drinkable.FN is the number of non-drinkable samples of water as drinkable.

Performance of Different Sets of Parameters
This section assesses how well the 'Cr' parameter predicts water quality and compares its performance with other parameters.In the first experimental setup, all 17 parameters are fed to the classifiers.The values of each evaluation metric: ACC, SN, SP, and PR are listed in Table 2.As demonstrated in Table 2, it has been seen that the highest classification accuracy obtained with this setup is 66.67% with the ANN.Like the ACC, SN and SP are also poor in all the classifiers.Compared to the previous configuration, we observed that the performance of the second experimental setup, which included only five statistically significant parameters, led to an impressive increase in classification accuracy.By employing the ANN model, the accuracy improved significantly to an impressive 83.33%.In addition to enhancing the ACC classification, the SN and SP also experience significant improvements.The classification performance of the statistically significant parameter 'Cr' was evaluated in the last experimental setup, as shown in Table 2.The results revealed a significant improvement in the classification accuracy, reaching an impressive 91.67% when using the ANN model.Moreover, the performance of the SP also showed noticeable improvement.

Performance Comparison of the Proposed Method with Existing Methods
Many researchers have applied ML algorithms to predict water quality.However, most of these studies have primarily focused on regression models.The effectiveness of their proposed methods has been assessed using metrics such as mean square error, Rsquared, and mean absolute error [1].Being a classification method, the proposed method is compared with only those existing methods which were also classification-based methods.The comparison is performed based on the size of the experimental dataset, number of parameters, ML methods and performance measures.The summary of the comparison is provided in Table 3.According to the data presented in Table 3, the deep neural network (DNN) proposed by U achieved an impressive accuracy of 93%, making it the highest recorded accuracy [42].However, due to the small dataset, we are not able to evaluate the performance of DNN in this current study.Despite having a minimal architecture, our proposed work achieves an impressive accuracy of 91.67%, which is comparable to the accuracy of DNN [42].Moreover, as illustrated in Table 3, most of the existing works utilized more than one parameter for water quality prediction.Unlike other methods, our proposed approach requires only one parameter, making it a highly practical solution for real-time water quality prediction from multiple sources.The proposed method is not only efficient, it also allows for seamless integration at a minimal cost.

Conclusions
An innovative and highly effective ML-based system has been developed in this study.The study emphasizes on identifying the most effective parameters to classify the water as either drinkable or non-drinkable.The efficacy of 19 different parameters was evaluated to identify the most discriminative parameters for the water quality prediction.The efficiency of discriminative parameters were evaluated by using three most widely used classifiers: ANN, LR and KNN.The conclusions of this study are listed below: (a) The MWW test reveals that out of all 19 parameters, five parameters including Fe, Cr, Na, Ca and Mg are statistically significant in water quality prediction.(b) Out of all these five statistically significant parameters, Cr is the most significant, as the range of Cr values for drinkable water differs from the range of Cr values for non-drinkable water.(c) The experimental results show that compared to LR and KNN, ANN is more efficient for water quality prediction.For different sets of parameters, the ANN always shows better results than both LR and KNN.Most of the time the difference is not less than 10%.(d) The system that utilizes only the statistically significant features like Fe, Cr, Na, Ca, and Mg achieves higher classification accuracy when compared to the system considering all parameters.(e) The system's development with the single parameter 'Cr' has resulted in the most efficient system, achieving an impressive classification accuracy of 91.67% using ANN.
The study concludes that ML enables us to recognize the unique characteristics of a particular area and leverage it to accurately forecast water quality.This method significantly minimizes the time and cost of determining the WQI.However, the small size of the experimental dataset is the only limitation of our proposed method.In future, we will try to collect more real-time data to validate the efficiency of our proposed method.

Water 2024 , 11 Figure 2 .
Figure 2. The boxplots for displaying the range of values for the statistically significant (a) Ca, (b) Cr, (c) Fe, (d) Mg, and (e) Na parameters in drinkable and non-drinkable water.

Figure 2 .
Figure 2. The boxplots for displaying the range of values for the statistically significant (a) Ca, (b) Cr, (c) Fe, (d) Mg, and (e) Na parameters in drinkable and non-drinkable water.

Table 1 .
The mean values of the parameters in drinkable and non-drinkable samples along with their p-values obtained with the MWW test.

Table 1 .
The mean values of the parameters in drinkable and non-drinkable samples along with their p-values obtained with the MWW test.

Table 2 .
The classification performance of different number of parameters in water quality prediction.

Table 3 .
Proposed method performance comparison with existing methods.