Water Quality Index Classification Based on Machine Learning: A Case from the Langat River Basin Model

: Traditionally, water quality is evaluated using expensive laboratory and statistical proce-dures, making real-time monitoring ineffective. Poor water quality requires a more practical and cost-effective solution. Water pollution has been a severe issue, hurting water quality in recent years. Therefore, it is crucial to create a model that forecasts water quality to control water pollution and inform consumers in the event of the detection of poor water quality. For effective water quality management, it is essential to accurately estimate the water quality class. Motivated by these con-siderations, we utilize the benefits of machine learning methods to construct a model capable of predicting the water quality index and water quality class. This study aims to investigate the performance of machine learning models for multiclass classification in the Langat River Basin water quality assessment. Three machine learning models were developed using Artificial Neural Networks (ANN), Decision Trees (DT), and Support Vector Machines (SVM) to classify river water quality. Comparative performance analysis between the three models indicates that the SVM is the best model for predicting river water quality in this study. In addition, there is a statistically significant difference in performance between the SVM, DT, and ANN models at the 0.05 level of confidence. The use of the kernel function, the grid search method, and the multiclass classification technique used in this study significantly impacts the effectiveness of the SVM model. The findings bolster the idea that machine learning models, particularly SVM, can be used to forecast WQI with a high degree of accuracy, hence enhancing water quality management. Consequently, the model based on machine learning lowered the cost and complexity of calculating sub-indices of six water quality parameters and classifying water quality compared to the standard IKA-JAS formula. SS, AN, and pH, which are chosen by an expert panel using Opinion Poll WQI to assess the water quality class of rivers in Malaysia. Meanwhile, the four attributes, STA NO, RIVER, SMP-DAT, and WQI, represent the monitoring station information and water quality index score value.


Introduction
Water quality has been monitored by the Department of Environment (DOE) since 1978, primarily to set guidelines for detecting changes in water quality and identify sources of pollution. Current water quality monitoring in Malaysia is based on the IKA-JAS. IKA-JAS is used to measure the degree of pollution and classify water quality in accordance with the National Water Quality Standard and the kind of water used. River water quality in Malaysia is categorized into five classes based on IKA-JAS. IKA-JAS is used to measure the pollution level and water use suitability as outlined by the National Water Quality Standard (SKAN). IKA-JAS considers the six water quality parameters in the formula and its calculations to produce a score value. The parameters are dissolved oxygen, biochemical oxygen requirement, chemical oxygen requirement, ammonia nitrogen, suspended solids, and pH. These parameters are obtained from water samples that have been analyzed to determine water's physical, chemical, and biological properties [1]. The score value obtained will then be compared with the IKA-JAS water quality index range to determine the water quality class. Water quality classification using the (FS-RBFNN) and its application for water quality prediction based on neuronal activity and reciprocal information (MI). The experimental results show that FS-RBFNN can be used to design RBF structures that have fewer hidden neurons. Hence, the training time is also faster.
Machine learning algorithms have succeeded in the environmental domain because of their ability to model complex relationships between variables. Although there are numerous machine learning algorithms, researchers continue to face challenges, such as determining which machine learning techniques should be used or are most appropriate for a specific problem. Therefore, to fill the gap based on past studies, this study aims to propose an approach based on machine learning techniques: the multiclass classification model to classify water quality index. This methodology is expected to change the way the classification of water quality index and water quality are monitored. Water quality assessment has become an important issue in water resources management. Conventional water quality assessment methods using WQI can be time-consuming and expensive, especially for complex datasets with multiple water quality parameters. Machine learning techniques have the ability to cut down on computation time, costs, and errors in water quality classification, water quality parameter forecasting, and water quality index forecasting. Machine learning is also considered an alternate technique for calculating the water quality index, including several sub-indices [15]. In recent years, machine learning approaches have been extensively utilized for river water quality assessment, including the calculation of WQI. These techniques have proven to be effective modeling tools for complicated non-linear processes in water resource studies. Each machine learning method has advantages and disadvantages, and its behavior depends on the input factors of water quality in the various research regions. Hence, a proactive approach is desperately needed to address Malaysia's water quality classification issues. For that purpose, an effective prediction model based on machine learning can be implemented. Therefore, this study aims to evaluate the performance of three machine learning algorithms on water quality classification problems. A classification model was developed based on Artificial Neural Network (ANN), Decision Trees (DT), and Support Vector Machine (SVM) to predict the WQI of the Langat River Basin.

Study Area
The Langat River Basin is one of Malaysia's most important river water catchment basins, as shown in Figure 1. It is located in the state of Selangor. The Langat River Basin spans an area of approximately 2409 km 2 and is located between latitudes 2°40′152″ N and 31°60′152″ N, and longitudes 101°19′20″ E and 102°1′10″ E [23]. The Titiwangsa granite mountain range is located upstream of the Langat River Basin, where the Langat River originates at Gunang Nuang and flows approximately 190 km through the states of Selangor and Negeri Sembilan, as well as the Federal Territories of Putrajaya and Kuala Lumpur, before entering the Malacca Strait. The river's basin is drained by the Langat, Semenyih, and Labu rivers. Meanwhile, the Langat River Basin's upstream dams are the Langat Dam and the Semenyih Dam. The two dams, Semenyih and Langat, in the Langat Basin, might serve as drinking water reservoirs. Apart from these dams, there are several ex-mining ponds scattered across the basin, particularly near the Paya Indah Wetlands in Kuala Langat. The basin's geography is described as both hilly and flat, with an average elevation of 400-1440 m. The average elevation of the central basin is less than 200 m, followed by less than 100 m in the lower basin. The igneous rock underneath the Langat River Basin is mostly granite. As a result, the basin's geology is characterized as Hawthornden Schist and Kenny Hill Formation (sandstone and phyllite). The primary soil types in the hilly upstream, flat midstream, and downstream include Tanah Curam (steep land), Rengan-Jerangan (urban land), and Tanah Gambut (peat soils). During the period 2005-2016, the average annual rainfall in the Langat River Basin ranged from 2043.68 to 2832.40 mm.

Research Methodology
The methodology of this study is divided into five phases, as shown in Figure 2: (i) data collection and understanding of water quality data from the Langat River Basin is obtained from the Malaysia Department of Environment and databases for the development of the multiclass model in predicting water quality classes; (ii) data preparation involves preprocessing data for minimizing or eliminating inconsistencies of data; (iii) multiclass classification model development involves the development of multiclass machine learning models to classify the water quality; and (iv) model evaluation involves the evaluation of the water quality multiclass classification model.

Phase I: Data Collection and Understanding
The Malaysia Department of Environment provided water quality data for the Langat River Basin for this study. The parameters used to analyze and evaluate water quality can generally be divided into physical, chemical, and biological parameters. Physical parameters include color, temperature, taste, odor, turbidity, and water solids suspended in water. Chemical parameters include dissolved oxygen, acidity, pH, and alkalinity, biochemical oxygen requirements, chemical oxygen requirements, ammonia nitrogen, electrical conductivity, total solids, and other pollutants. In contrast, biological parameters include bacteria such as fecal coliforms and algae.
The Langat River Basin raw water quality dataset is a time-series dataset. This raw dataset has a total of 560 records and 46 attributes taken from fourteen monitoring stations recorded over five years, from January 2012 to December 2016. The information related to the 46 attributes is divided and described as follows. Table 1 shows a list of attributes that contain information related to the monitoring station where the raw water quality dataset was obtained. Table 1 also presents the attributes of the sub-index of parameters used in calculating the Water Quality Index-Department of Environment (WQI-JAS) formula for water quality assessment. The six water quality parameter attributes used in the WQI-JAS formula are: DO, BOD, COD, SS, AN, and pH, which are chosen by an expert panel using Opinion Poll WQI to assess the water quality class of rivers in Malaysia. Meanwhile, the four attributes, STA NO, RIVER, SMP-DAT, and WQI, represent the monitoring station information and water quality index score value.

Phase II: Data Preparation
Data preparation begins with data exploration using data preprocessing techniques to better comprehend the study's datasets. There are several data preprocessing techniques used in the study. The basis of data preprocessing is a descriptive statistical analysis that helps study the general characteristics of the data and identify the presence of noise or outliers. Next, data cleaning can eliminate noise and inconsistencies in the data. Data transformations such as normalization can help improve the accuracy and efficiency of machine learning algorithms [24]. At the same time, data visualization is very helpful in visually inspecting data using plot graphs.
The Langat River Basin water quality dataset used in this study has a dimensional space of 560 rows and 10 columns. There are seven rivers in the Langat River Basin: Langat; Semenyih; Lui; Pajam; Value Bar; Jijan; and Pumpkin Stems, which are the locations where monitoring stations are located, and river water samples are collected.
Identifying the data type for each attribute is important because data are obtained in various formats and types. Table 1 shows a list of data types for each attribute in the initial dataset that will be analyzed statistically and descriptively in the next process. Descriptive statistical analysis of the data can better describe the data and help to understand the main features of the data distribution. The characteristics of the dataset used in the study can be understood through measures of central tendency such as mean, median, and mode; and measures of data dispersion such as range, quartile, variance, and standard deviation. This descriptive statistical summary technique can be used to handle data for machine learning tasks such as identifying data properties and highlighting values identified as noise data, missing data, or outliers [25]. Table 2 shows a descriptive statistical analysis of six attributes from the initial dataset representing the water quality parameters used in calculating the WQI-JAS formula. The Missing Value column means no data value is recorded for the attribute in the observation. Missing data is a common phenomenon and can significantly impact the conclusions drawn from the data [26]. The Minimum and Maximum columns represent the minimum and maximum values in the study dataset, while the Mean (μ) columns are the average values of the attributes. The collection and storage of water quality data from in situ monitoring stations as well as those sent for analysis in the laboratory are highly vulnerable to noise data, missing data, as well as incomplete and inconsistent data. Low-quality data will result in lowquality data mining results as well. The data cleaning process cleans up data by filling in missing values, smoothing out noise data, identifying or removing external elements, and resolving inconsistent data problems.
Data recorded from laboratory work carried out by different individuals or teams can result in errors such as multiple copies, data redundancy, repetitive data, and inconsistencies. This recurring data was removed through a data cleansing process, and the current total of the initial dataset is 553 records.
Raw data is rarely cleaned and often has corrupted or missing values. Therefore, it is important to identify, mark, and handle missing data before implementing the machine learning model development phase [27]. This process is important to get the best model performance. A total of 5 records out of 553 initial dataset records will be deleted through the data cleanup process. These records were removed from the initial dataset because the number of missing value records was small, and less than 5% of the water quality samples were obtained. In addition, missing data from BOD, COD, SS, and WQI attributes will affect the water quality classification model to be developed.
In data transformation, data is transformed or consolidated into a suitable form for the data mining process [28,29]. In order to transform data beyond the initial dataset, attribute construction and normalization are required. In this study, a new attribute was derived from the initial dataset's attributes in order to facilitate data mining.
The CLASS attribute is a newly formed attribute derived from the WQI attribute. This feature represents the water quality class in the Langat River Basin at a particular location and time. Table 3 thoroughly defines the CLASS attribute as a nominal-type output variable, four class labels, and the number of records associated with each class. The definition of representation refers to the class of water and the suitability of the type of water used as outlined by the National Water Quality Standard (SKAN).  Table 3 is based upon the number of CLASS attribute records representing water quality in the Langat River Basin from January 2012 to December 2016. The CLASS attribute as an output or target variable referred to as a label in a classification model for water quality will be created utilizing 548 water quality dataset records. Consequently, based on the obtained label values, namely CLASS I, CLASS II, CLASS III, and CLASS IV. Since there is more than one class attribute, this study is a multiclass classification problem.
The data normalization process involves scaling the six attributes of water quality parameters into a small range of values due to the measurement of water quality parameters in different units. The normalization method introduced by Hameed et al. [22] will be used in this study by setting the minimum value of the normalized data set to 0.1 and the maximum value normalized to 0.9. This data normalization method was chosen because the environment of the study area used is almost similar to the current study and tends to give better results with high-performance accuracy.
Data normalization was performed according to the following Formula (1): Based on Formula (1), Xnew is the normalized value of the original attribute, X is the original data point, and Xmin and Xmax are the minima and maximum values in the dataset.
Since the focus of the study was on water quality assessment based on the water quality parameters used in the calculation of the WQI-JAS formula, only the attributes related to WQI-JAS were selected, which are DO, BOD, COD, SS, pH, and AN. Four attributes have been eliminated during the attribute reduction step: STA NO, RIVER, SMP-DATE, and WQI. The initial filtered water quality dataset was reduced from ten to six attributes. In the next phase, a dataset of 548 records will be used as a training and test dataset in developing a multiclass classification model for water quality classification.

Phase III: Multiclass Classification Model Development
The algorithms that have been used in previous studies for water quality assessment have their strengths and limitations. Therefore, based on the previous studies that have been described, the ANN, DT, and SVM algorithms were selected to perform the classification task on the Langat River Basin water quality dataset used in this study. The ANN algorithm was chosen because of its ability to model non-linear and complex relationships between input and output variables or where relationships between input variables are difficult to understand [29], especially when involving water's physical, chemical, and biological parameters. The DT algorithm was chosen because it is a classification technique that is easy to understand and widely used. In addition, it is appropriate to train the dataset used, where memory usage is minimized, making it a time-saving and cost-effective approach for water quality classification [3]. Next, the SVM algorithm was chosen because it can model non-linear relationships between input variables. It uses non-linear mapping to transform the original training data to a higher dimension. It works by classifying data into different classes by finding the optimal separator hypersometric. The optimal separator hyperplane separates the training data into classes and maximizes the distance to the nearest point from one of the classes. As a result of maximizing the margin between the two classes on the training data, the classification performance is better on the test data, thus achieving maximum generalization [30]. SVM is also the most commonly used data mining algorithm for water quality assessment. It is a powerful alternative to artificial neural networks in predicting water quality in non-point-source polluted rivers [5].
Three algorithms (i.e., ANN, DT, and SVM) have been developed to find the best classifier for the Langat River Basin water quality assessment. Figure 3 shows the experimental design for the multiclass classifier used in this study. The model development involves: (1) dividing the data into training and testing sets using the cross-validation method; (2) selecting an algorithm for the classification model; (3) the model parameters optimization; and (4) training and evaluating the model. Based on Figure 3, three multiclass classifiers will be developed and evaluated to determine the best classifier. This study will employ the search grid method to identify the best parameters for the classification model to optimize the model's performance. Whereas the performance of each classifier is measured using k-fold cross-validation, where k = 10.

Neural Network Settings
Neural networks consist of a set of interconnected layers. The first layer is the input layer and is connected to the output layer by an acyclic graph consisting of weighted sides and nodes. Between the input and output layers, there is a hidden layer where the prediction task can be performed easily with only one or several hidden layers. In general, neural network classification requires a labeled dataset containing labeled columns. The labeled datasets act as inputs to train the model. Values are calculated at each node in the hidden layer and at the output layer to calculate the network output for a particular input. The value is set by calculating the weighted sum of the node values from the previous layer. The activation function is then applied to that weighted amount.
According to Sani et al. [11], the choice of network type relies on the problem being solved. Backpropagation algorithms are commonly used to train neural networks. This exercise is usually performed by updating the weights iteratively based on the error signal. The error is calculated in the output layer as the difference between the true class and the actual output value multiplied by the gradient of the sigmoid activation function. Then, the error signal is propagated back to the bottom layer. Therefore, backpropagation is a gradient descent algorithm that tries to minimize each iteration's error. The learning algorithm adjusts the network weights so that the error decreases along the descending direction.
The neural network algorithm used in this study is based on a multi-layer feed-forward neural network trained with stochastic gradient decrease using backpropagation. As shown in Table 4, the developed ANN model has three parameters configured to find the optimum value. The grid search method was used to obtain the best parameter settings for the ANN model. The learning rate of annealing reduces the learning rate to be trapped into the local minimum in the optimization space {0.01-1.0}

Decision Tree Settings
The basic idea behind the decision tree is to use a divide-and-conquer approach. The decision tree (DT) algorithm can address binary or multiple class classification problems and can be represented in the form of a tree structure. Each tree node can be a leaf node or a decision node. The leaf node denotes the value of the target attribute or class. For multiclass classification problems, leaf nodes can refer to one of the relevant N classes. The result node determines the number of tests performed on one attribute from an existing observation by generating one possible branch of that test. The process of classifying specific data through a result tree begins by evaluating the test contained in the root node or result node. It moves through it up to the leaf node, which determines the classification of the data. In this study, the DT model to be developed has several parameters that need to be configured. As shown in Table 5, four parameters in the DT model will be set to find the optimal value. The grid search method was used to obtain the best parameter settings for the DT model. According to Haixiang et al. [31], the quality of SVM parameter selection and kernel functions affects the performance of the SVM model. Therefore, the appropriate value of the kernel function and its parameters should be selected to obtain optimal classification performance. Once the appropriate kernel functions and their parameters have been obtained, then the prediction errors of the SVM model can be minimized. In this study, the developed SVM model has three parameters that will be configured to find the optimal value as shown in Table 6. The grid search method was used to obtain the best parameter settings for the SVM model

Phase IV: Multiclass Classification Model Evaluation
The prediction model's performance was evaluated by comparing the values of accuracy, precision, recall, and F1-Score. Those values were calculated based on the confusion matrix. Prediction results and actual class were put in a matrix for comparison depending on a positive and negative value. Confusion matrices have two types of errors: Type I and Type II. A Type I error is also known as a false positive, and a Type II error is known as a false negative. For a multiclass problem with a number of classes k, the size of the confusion matrix is N kxk . The confusion matrix can represent the classification results, as shown in Table 7. Table 7. Multiclass confusion matrix.

Predicted
True Based on Table 7, where k = {1, 2, …, n} while each element Ci, j represents the number of tuples predicted to be class i but belong to class j. At the same time, the sum of all the elements in the confusion matrix is equal to the sum of the N samples given to the classifier. From the confusion matrix of the various classes, the number of true positive predictions of TP for each class k is given by Equation (2) where k is a reference to an individual class.
Next, the number of false negative predictions of FN (Type II error) for each true class k can be obtained based on the following Equation (3): where n = total number of classes, i = predicted class row, and j = correct class column. The number of true negative predictions of TN for each class k can be calculated according to the following Equation (4) Next, the number of false positive predictions of FP (Type I error) for each class k is given by the following Equation (5) The experiment results of ANN, DT, and SVM are recorded and analyzed to see the performance of each multiclass classification model. These multiclass classification models were developed to classify four class labels, namely 'CLASS I', 'CLASS II', 'CLASS III', or 'CLASS IV', for water quality assessment of the Sungai Langat Basin.
Three classification algorithms are compared in this study, which are ANN, DT, and SVM. Each classifier is tuned using different tuning parameters to produce highly accurate results. This study utilized the search grid approach to determine the optimal parameters for the classification model to optimize the model's performance. A series of experiments were conducted to get the optimal values of each classifier. The performance between the three classifiers is then evaluated and compared. Table 8 shows the optimum parameter setting after the model parameter optimization experiment for the ANN, DT, and SVM. The investigation revealed that the optimal ANN parameters are the activation function = rectifier, learning rate = 0.2, and annealing rate = 0.5. Meanwhile, for the DT model settings, the experiment showed the gini_index is the criteria that determines the separation of attributes, with the minimum value to separate the node being 0:09 and the confidence value being 0.12500005. Additionally, Table 8 shows the optimum parameter of the SVM model, in which the kernel function for the model is the linear function, while the constant complexity C is set as 100.0. The parameter polynomial by binomial classification is used to expand from binary to multiclass classification, and the technique 1 against 1 has been selected.  Table 9 shows the results of TP, FP (Type I error), TN, and FN (Type II error) values for each predicted class. Based on Table 9, the ANN model correctly classified the data with the true positive (TP) values represented by TPI, TPII, TPIII, and TPIV, where TPI =  14, TPII = 163, TPIII = 301, and TPIV = 22, according to Equation (2). Meanwhile, the DT model correctly classified the data at the TP rate as TPI = 16, TPII = 146, TPIII = 307, and TPIV = 21. Furthermore, the SVM model correctly classified the TP rate with TPI = 22, TPII = 158, TPIII = 305, and TPIV = 23. Based on the confusion matrix results shown in Table 9, the rate of TP for models ANN, DT, and SVM is compared and analyzed in the graph, as shown in     (5), respectively, by adding all the classification errors in the respective class column. Table 10 shows that for the ANN, the value of FN (Type II error) and FP (Type I error) is 48. While for the DT, the value of FN (Type II error) and FP (Type I error) is 58. Whereas for the SVM, the value of FN (Type II error) and FP (Type I error) is 40. In contrast, each class's true negative (TN) can be calculated by following Equation (4). As for the comparison of the three models, the model to reach the highest TP is the SVM, with a total number of TP of 508 correctly classified as compared to the ANN of 500 and followed by model DT of 490. The results show that the SVM classifier performs better in classifying data into different classes. In order to select the best model, the confusion matrices were used to measure accuracy, precision, recall, and F1-Score by using macro average and micro average approaches. ROC curves are also a good method for evaluating models, but the classification model developed in this study is based on an imbalanced dataset; consequently, the ROC curve is not a good visual representation of imbalanced data because decision thresholds are not explicitly depicted in the ROC curve, and the distinction between the models is difficult to define [32]. Consequently, we present the total performance of classification models using macro-averaged scores to avoid bias for major categories in the imbalanced data associated with micro-averaged scores. This is because we are particularly concerned with the performance of minor categories. Macro averaging assumes all classes are equal and important, while micro averaging favors a larger class. Because of the imbalanced data, macro averaging is used to evaluate the performance of the classifier model for each class, while micro averaging is used to evaluate the performance of the classifier model as a whole. The macro averaging and micro averaging of accuracy, precision, recall, and F1-Score can be calculated according to the following Equations  Table 11 shows the performance of ANN, DT, and SVM based on the accuracy (Acc), precision (Pr), recall (Rc), and FI-Score (F1) using macro and micro averaging. The model that provides predictions with higher accuracy, precision, and recall can perform the classification task.  Figure 5 shows a comparison graph of model performance based on the percentage of prediction accuracy, precision, recall, and F1-Score. The micro averaging accuracy shows that SVM produces the highest micro accuracy of 92.7% compared to the ANN at 91.24% and the DT at 89.42%. The same value for micro accuracy, micro precision, micro recall, and micro F1-Score of each model, where Acc = Pr = Rc = F1, is shown in Figure 5. Results of Pr micro and Rc micro are the same with precision micro when each data point is only given to one class. This can be explained where the count value Pr micro and Rc micro by Equations (7) and (8), and the rate of TP, FP (Type I error), and FN (Type II error) for all label classes must be known in advance (refer to Table 10). Therefore, when a data point from one class label is predicted, if the result is a false positive FP (Type I error), then there will be a false negative FN (Type II error) and vice versa. For example, if the label 'CLASS I' is predicted and the true class label is 'CLASS II', then the prediction result or 'CLASS I' is false positive FP (Type I error) and false negative FN (Type II error) or 'CLASS II'. If the prediction is correct, the 'CLASS I' is predicted, and the class label is 'CLASS I', the results of the prediction are true positive TP.
Referring to Figure 5, SVM again shows the highest macro accuracy of 96.35% compared to the ANN and DT models. Additionally, SVM shows the highest macro precision of 91.97%, macro recall of 84.89%, and micro F1-Score at 87.98% (refer to Figure 5) compared to the ANN and DT models. The macro accuracy of ANN was at 95.62%, supported by the macro precision at 92.06%, macro recall at 77.39%, and macro F1-Score at 82.43%.
Next, the macro accuracy of DT results at 94.71% is supported by the macro precision at 89.22 %, macro recall at 76.35%, and macro F1-Score at 81.42%.

Statistical Significance Tests
Comparing the machine learning algorithms and selecting the best model is a common process in the machine learning task. In this study, the performance of algorithms is compared using statistical tests, namely the paired corrected t-test on 548 instances of water quality data. The ANN, DT, and SVM classifiers were evaluated against the water quality data sets with a significant level of 0.05 (95%). In the t-test, ANN is used as a baseline model. For the comparison with DT and SVM, 10-fold cross-validation is used.
From Table 12, the algorithm ANN, as the baseline for comparison, is marked as (1) and has an accuracy of 95.62%. These results are compared with the DT algorithm marked as (2) and the SVM algorithm marked as (3). The symbol '*' next to the DT result indicates that the result is different from the ANN result, but the score is lower. In contrast, the symbol 'v' next to SVM showed that the SVM results are larger than ANN, and the difference is significant, with an accuracy of 96.35%. This shows that the SVM is the best multiclass classifier and is statistically significant at a 0.05 confidence level.

Discussions
This study examines the use of machine learning algorithms for multiclass classification models in assessing water quality in the Langat River Basin. A set of water quality data for a period of five years (2012-2016) were used in this study. Three series of experiments were conducted using a neural network algorithm, a decision tree algorithm, and a support vector machine algorithm with six water quality parameters as model input, as well as four WQI class labels as the target output.
In this study, the outcomes revealed that three machine learning models perform well in predicting the WQI; however, SVM is the most accurate model developed using a small dimensional space of 560 rows and 10 columns of dataset. According to previous studies [5,33,34], the SVM is an efficient method and has outperformed artificial neural networks in many studies related to the classification of water quality data. In prior studies, the SVM algorithm effectively predicts water quality using small datasets, whereas the ANN approach is superior for large, high-dimensional datasets. Moreover, it is necessary to specify a kernel function of good quality to achieve an SVM with high classification accuracy. The experimental results show that for the stated SVM parameter setting, the linear kernel is found to be the best choice for the classification process. In this study, the model used the kernel function to map input data into the high-dimensional feature space and find the optimum hyperplanes to separate the two data classes. The experimental results also show that the SVM model can achieve the highest performance using the 1 against 1 multiclass approach.
Furthermore, the ANN model achieved better accuracy in classifying given input data based on the water quality parameters of the Sungai Langat Basin compared to DT. The experimental results show that the ANN model can improve the accuracy of water quality class classification and model the complex non-linear relationship between the input and output data compared to other conventional techniques. Several studies have also supported this; ANN shows high-performance accuracy compared with other conventional methods in modeling and predicting the water quality index in a tropical environment [11,[19][20][21][22][33][34][35]. In addition, the DT model is good for identifying the WQI class label because it is simple to comprehend and implement. It is capable of facilitating, analyzing, and classifying water quality data. However, further investigations are required to improve the accuracy of the DT model.
The classification model developed in this study involves an unbalanced dataset because the number of data points in each class is different. In other words, one class label has a very large number of observations whereas the other has a very small number of observations. According to Haixiang et al. [31], an imbalanced dataset is referred to as the dataset in which one or more classes have a larger number of samples than the other. The class with the highest number of samples is called the majority class, while the lowest number of samples is called the minority class. Nevertheless, in this study, each class is important because it refers to the type of water class and the suitability of the type of water use as outlined by the National Water Quality Standards. Therefore, to address the imbalanced dataset problem, a confusion matrix can be used to show a more detailed breakdown of the true and false classifications for each class. The confusion matrix used includes a column matrix representing true class labels, while the row matrix represents predicted classes.
On the other hand, more detailed fractional information for correct classification and incorrect classification for each class would be lost if model performance evaluations were only measured using the overall accuracy of all classes. The overall accuracy of all classes will give a less accurate picture because larger classes will dominate the results. Therefore, the model performance evaluation for each class was measured using the average of each class's accuracy because there were different numbers of data for each class in the dataset used in this study. The average of each accuracy class is also known as accuracy using macro averaging. The evaluation based on macro averaging is important when the study dataset has unbalanced classes. Since we are highly concerned with the performance of every category, particularly the minor ones, we present the overall performance using macro-averaged scores to prevent bias for major categories in the imbalanced data associated with micro-averaged scores [36]. Other metrics, such as accuracy, recall, and F1-score, are frequently used in the research community to evaluate models trained on imbalanced data [37]. Therefore, the confusion matrix, together with accuracy, precision, recall, and F1-Score (based on macro and micro averaging), is used as the performance measure. In general, for a given class label, different combinations of precision and recall represent the following meanings; (1) high precision and high recall indicate that the class label is handled perfectly by the classifier model; (2) high precision and low recall indicate the class label cannot identify the class label well but is very reliable when it occurs; (3) low precision and high recall indicate the class label is well identified, but the classifier model also identifies other classes in it; and (4) low precision and low recall indicate the class label is not well handled by the classifier model.
Thus, the experimental results and model evaluation based on the multiclass confusion matrix, including accuracy, precision, recall, and F1-Score using macro averaging, show the SVM model is the best multiclass classifier compared to ANN and DT. The t-test results have also shown that the SVM model is the best multiclass classifier, and the test results are statistically significant at the 0.05 confidence level. The findings strengthen the argument that machine learning models, particularly SVM, may be used to forecast WQI with a high accuracy level, hence enhancing water quality management. Overall, the experimental and evaluation results of the models presented could change the way WQI classes are classified, and water quality monitored in the future, thus enabling better water resources management by reducing costs and time involved in monitoring and evaluation processes.
Although this study was successful in accomplishing its objectives, there is still an opportunity for continuous improvement to be carried out. This is because, when developing classification and prediction models, the primary focus is on how to generate better models and results. The use of larger datasets to predict water quality classes is one of the recommended enhancements. Using larger datasets involves training on historical data and developing classifier models to predict new data. Furthermore, using larger datasets allows for the discovery of more hidden patterns. The models generated can be improved by conducting more detailed investigations into the associations between water quality measures. At the same time, the experimental design for model configuration can be modified automatically to give additional functions such as loss function optimization, utilizing search methods such as random search that explore over hyperparametric space to find a desirable configuration value.
Next, the use of feature selection algorithms can be introduced in future studies to test the prediction and accuracy of classification models based on various scenarios consisting of different water quality parameters. In addition, supervised machine learning algorithms can be utilized for time series prediction problems on the raw water quality dataset of a time series dataset. Several supervised machine learning algorithms have recently been developed for the R and Python programming environments. This opportunity can be taken to explore algorithms developed to solve time series prediction problems.

Conclusions
This paper presents the experimental results of the model development phase that has been implemented. The experimental results were analyzed comparatively based on the multiclass confusion matrix, followed by accuracy, precision, and recall using macro and micro averaging to evaluate the performance of the ANN, DT, and SVM models. Overall, the experimental results show that the ANN, DT, and SVM performance is good while having advantages and disadvantages, respectively.
Next, the performance of each classification model was measured and compared using evaluation metrics that include confusion matrix, accuracy, precision, and retrieval to determine the best multiclass classification model. Comparative analysis based on accuracy, precision, and retrieval using macro and micro averaging showed that all three models had achieved more than 85% performance. The best model that achieved the highest classification performance was SVM, with an accuracy rate of 96.35%, supported by a precision value of 91.97% and a recovery of 84.89% based on macro averaging. The SVM model is also a multiclass model that is good at classifying data from different classes based on a confusion matrix. As a result, the SVM model was selected as the best classifier model in this study, and the t-test showed that the test results were statistically significant at the 0.05 confidence level.
This study succeeded in identifying the best techniques and algorithms among the three models developed to classify the suitability of water use types according to the standards as outlined by SKAN. Earlier, monitoring data recorded over five years from January 2012 to December 2016 from seven rivers managed by DOE were processed in this study using data preprocessing techniques. The classification model developed has successfully identified water quality from various classes based on clean and quality Langat River Basin data.
Moreover, the multiclass classification model developed using the decision tree algorithm in this study has achieved higher accuracy with six optional parameters determined by the expert panel as compared to the decision tree model developed in the previous study by Ho et al. [3]. As a result, as compared to utilizing the traditional IKA-JAS formula, the research strategy based on data mining and machine learning approaches reduced the cost and complexity of calculating sub-indices of six water quality parameters and classifying water quality.
One of the most important tasks in developing a machine learning model is evaluating its performance to assure the classifier model's success and the study's effectiveness. This study has mediated the use of metric evaluation based on micro and macro averaging to address the classification problem of various classes and unbalanced datasets. The effectiveness of the study has indirectly contributed to the improvement of water quality management by providing data mining approaches and machine learning techniques that offer a variety of classification and forecasting methods to meet the specific needs of policymakers, environmental experts, and the general public.
This study achieved its goals, yet there is room for improvement. When developing classification and prediction models, the focus is on improving results. For predicting water quality classes, larger datasets are recommended. Training on greater historical data and developing classifier models to predict new data are required when using larger datasets. Larger datasets reveal more hidden patterns. More extensive research of water quality measures can improve the models. The experimental design for model configuration can be modified automatically to give additional functions such as loss function optimization using search methods such as random search to explore hyperparametric space. Future studies can use feature selection algorithms to test the accuracy of classification models based on diverse water quality parameter scenarios.
Furthermore, supervised machine learning algorithms can be used to solve time series prediction issues on a raw water quality dataset. Several supervised machine learning techniques for the R and Python programming environments have recently been created. This is an excellent opportunity to investigate techniques designed to handle time series prediction challenges.