Modeling Surface Water Quality Using the Adaptive Neuro-Fuzzy Inference System Aided by Input Optimization

: Modeling surface water quality using soft computing techniques is essential for the effective management of scarce water resources and environmental protection. The development of accurate predictive models with signiﬁcant input parameters and inconsistent datasets is still a challenge. Therefore, further research is needed to improve the performance of the predictive models. This study presents a methodology for dataset pre-processing and input optimization for reducing the modeling complexity. The objective of this study was achieved by employing a two-sided detection approach for outlier removal and an exhaustive search method for selecting essential modeling inputs. Thereafter, the adaptive neuro-fuzzy inference system (ANFIS) was applied for modeling electrical conductivity (EC) and total dissolved solids (TDS) in the upper Indus River. A larger dataset of a 30-year historical period, measured monthly, was utilized in the modeling process. The prediction capacity of the developed models was estimated by statistical assessment indicators. Moreover, the 10-fold cross-validation method was carried out to address the modeling overﬁtting issue. The results of the input optimization indicate that Ca 2+ , Na + , and Cl − are the most relevant inputs to be used for EC. Meanwhile, Mg 2+ , HCO 3 − , and SO 42 − were selected to model TDS levels. The optimum ANFIS models for the EC and TDS data showed R values of 0.91 and 0.92, and the root mean squared error (RMSE) results of 30.6 µ S/cm and 16.7 ppm, respectively. The optimum ANFIS structure comprises a hybrid training algorithm with 27 fuzzy rules of triangular fuzzy membership functions for EC and a Gaussian curve for TDS modeling, respectively. Evidently, the outcome of the present study reveals that the ANFIS modeling, aided with data pre-processing and input optimization, is a suitable technique for simulating the quality of surface water. It could be an effective approach in minimizing modeling complexity and elaborating proper management and mitigation measures.


Introduction
Surface water bodies are naturally available resources and have always been considered essential for the persistence of the ecosystem. The quality of these water resources is adversely affected due to anthropogenic activities, including industrialization and population growth. The term "water quality" refers to the physical, chemical, and biochemical TDS concentrations. The authors reported excellent performance and accurate predictions by the ANFIS and ANN. A study by Shah et al. (2020) [17] compared the performances of ANN, GEP, and regression techniques, in terms of predicting the TDS and EC concentrations. A sensitivity and parametric study was carried out for evaluating the connection between the inputs and the output. The authors reported improved performance of the GEP as compared with the ANN and regression techniques. Sensitive parameters were identified, which had a direct impact on the modeling output.
AI and soft computing techniques have been successfully applied in the abovementioned studies for water quality prediction but with some shortcomings. Algorithms such as ANN, SVM, and GEP have some unknown parameters [12,30]. These parameters have a significant effect on the accuracy of the model output. A common and reoccurring issue in the use of these algorithms is that these techniques may be trapped in a local optimum [30]. Moreover, the water quality data is highly chaotic, stochastic, and nonlinear, and the development of a standalone AI-based model has limitations in water quality modeling. Therefore, integrating the data pre-processing and optimization approaches with AI models are likely to enhance their accuracy and predicting capabilities.
In this study, the data pre-processing, followed by the exhaustive search method for input optimization and ANFIS modeling, are proposed to solve the complexity of modeling surface water quality. A two-sided outlier detection approach was used for data preprocessing, with the threshold outlier values set to ±3σ (sigma rule). Various water quality parameters, recorded monthly over a period of 30 years (1975-2005), were used in the modeling process. An optimization routine was developed to select the most correlated and significant input variables. Afterward, an efficient ANFIS model structure was developed, which was efficient in predicting the surface water quality indicated by the EC and TDS concentrations in the upper Indus Basin (UIB). The best ANFIS structure, which yielded the lowest modeling error with a minimum rules number, was selected for reducing the modeling complexity. Furthermore, cross-validation was employed to evaluate the final outcome from the ANFIS model. The methodology adopted in this study will help in minimizing the data sampling and processing efforts in surface water quality assessments.

Case Study and Modeling Dataset
The Indus River is 2880 km long and is considered a major river in Asia, with a drainage area of almost 912,000 km 2 [31]. The portion of the Indus River upstream of the Tarbela reservoir is the upper Indus basin (UIB). It has a total length of 1150 km and drains a large area of 165,400 km 2 [32,33]. The elevation varies from 455 m to 8611 m, and the climate differs significantly inside the basin. The annual precipitation range is 100-200 mm and occurs due to the turbulences in the western mid-latitude [34][35][36][37]. The study area is shown in Figure 1.
The data employed in this study for ANFIS model development was collected from water & power development authority (WAPDA), Pakistan. The dataset contained 321 monthly data points collected over a period of 30 years (1975 to 2005), measured at the Bisham Qilla outlet. The acquired data have the information of nine variables, which are calcium (Ca 2+ ), magnesium (Mg 2+ ), sodium (Na + ), chloride (Cl − ), sulphate (SO 4 2− ), bicarbonates (HCO 3 − ), pH, EC, and TDS. The descriptive statistics, that is, the mean, skewness, standard deviation, and kurtosis of the data, are given in Table 1. The normal probability curves in Figure 2 show the distribution of the target EC and TDS concentrations. A symmetrical curve of the mean of the dataset depicts a normal distribution [19,38]. Moreover, the literature demonstrates that data preprocessing is an essential process in any data mining process, aiming to eliminate the effect of missing or outlier measurements, which may occur during the collecting of the data [19,39]. In this study, a two-sided outlier detection approach was used for data preprocessing and outlier elimination, with the threshold outlier values set to ±3σ (sigma rule). may occur during the collecting of the data [19,39]. In this study, a two-sided outlier detection approach was used for data preprocessing and outlier elimination, with the threshold outlier values set to ±3σ (sigma rule).     may occur during the collecting of the data [19,39]. In this study, a two-sided outlier detection approach was used for data preprocessing and outlier elimination, with the threshold outlier values set to ±3σ (sigma rule).   2. Normal probability curves of the electrical conductivity (EC) and total dissolved solids (TDS) data. Figure 2. Normal probability curves of the electrical conductivity (EC) and total dissolved solids (TDS) data.

Input Optimization and Model Development
The Adaptive Neuro-Fuzzy Inference System (ANFIS) model is a kind of ANN, which is based on implementing the Takagi-Sugeno (TS) fuzzy approach, as shown in Figure 3. ANFIS implements fuzzy logic (FL) in the framework of an ANN [40]. The development process of ANFIS modeling involves identifying the most relevant inputs that correlate with a targeted output. The defining optimum rules, types, and the number of the associated membership functions (MFs) need to be evaluated, aiming at selecting the optimum ANFIS model structure with the lowest yielded errors. As an example, two TS fuzzy sets of "if-then" rules in a typical ANFIS structure are the following: The Adaptive Neuro-Fuzzy Inference System (ANFIS) model is a kind of ANN, which is based on implementing the Takagi-Sugeno (TS) fuzzy approach, as shown in Figure 3. ANFIS implements fuzzy logic (FL) in the framework of an ANN [40]. The development process of ANFIS modeling involves identifying the most relevant inputs that correlate with a targeted output. The defining optimum rules, types, and the number of the associated membership functions (MFs) need to be evaluated, aiming at selecting the optimum ANFIS model structure with the lowest yielded errors. As an example, two TS fuzzy sets of "if-then" rules in a typical ANFIS structure are the following:


Layer 1, or fuzzification layer, receives the input values and identifies the MFs.  Layer 2, or rule layer, generates the firing strengths for the rules.  In the present study, an ANFIS edit toolbox and coding in the MATLAB 2019b environment were used to train and develop the proposed ANFIS models. As mentioned earlier, seven input variables and two outputs were involved in the modeling process. All selected input parameters were related to the targeted surface water quality (EC and TDS). Moreover, the model training phase was conducted using the odd records (data points), while the even records (data points) were used in the model testing phase. The ANFIS learning process was repeated for many epochs, with an aim at reducing the errors between the actual and the ANFIS modeling output. A flowchart of the data pre-processing, input optimization, and the ANFIS structure development is presented in Figure 4. where p 1 , q 1 , p 2, and q 2 are ANFIS parameters, while Ai and Bj are the linguistic labels or grades. According to Ying et al. (1995) [41], ANFIS architecture consists of five layers. See Figure 3 for ANFIS architecture adapted from [42]. A brief description of the role of these layers are described as follows.

•
Layer 1, or fuzzification layer, receives the input values and identifies the MFs. • Layer 2, or rule layer, generates the firing strengths for the rules. In the present study, an ANFIS edit toolbox and coding in the MATLAB 2019b environment were used to train and develop the proposed ANFIS models. As mentioned earlier, seven input variables and two outputs were involved in the modeling process. All selected input parameters were related to the targeted surface water quality (EC and TDS). Moreover, the model training phase was conducted using the odd records (data points), while the even records (data points) were used in the model testing phase. The ANFIS learning process was repeated for many epochs, with an aim at reducing the errors between the actual and the ANFIS modeling output. A flowchart of the data pre-processing, input optimization, and the ANFIS structure development is presented in Figure 4.

Model Assessment Criteria
The obtained results from the model were evaluated using numerous statistical checks. R-squared (R 2 ) was used to evaluate the relationship between the observed values and the predicted values. The equation for calculating R 2 is denoted by Equation (1), as follows: In Equation (1), n, t, and y are the numbers of the observed data, observed values, and predicted values, respectively, whereas, t and y are the average observed and predicted values. The range of R 2 values 0-1, with 1 being the highest accurate relationship possible. However, values of R 2 greater than 0.7 are considered highly reliable in engineering models.

Model Assessment Criteria
The obtained results from the model were evaluated using numerous statistical checks. R-squared (R 2 ) was used to evaluate the relationship between the observed values and the predicted values. The equation for calculating R 2 is denoted by Equation (1), as follows: In Equation (1), n, t, and y are the numbers of the observed data, observed values, and predicted values, respectively, whereas, t ̅ and y ̅ are the average observed and predicted values. The range of R 2 values 0-1, with 1 being the highest accurate relationship possible. However, values of R 2 greater than 0.7 are considered highly reliable in engineering models.
Other proven statistical checks, like the mean absolute error (MAE) and root mean squared error (RMSE), were also applied to evaluate the accuracy of the developed model. One of the main advantages of using RMSE is to assign higher weightage (as it contains a square) to larger errors. Equations (2) and (3) show the mathematical expressions for the calculations of MAE and RMSE, respectively.
In addition to R, MAE, and RMSE, the percentage relative error (RE%) and the Nash- Other proven statistical checks, like the mean absolute error (MAE) and root mean squared error (RMSE), were also applied to evaluate the accuracy of the developed model. One of the main advantages of using RMSE is to assign higher weightage (as it contains a square) to larger errors. Equations (2) and (3) show the mathematical expressions for the calculations of MAE and RMSE, respectively.
In addition to R, MAE, and RMSE, the percentage relative error (RE%) and the Nash-Sutcliffe coefficient (NSC) were also used to assess the accuracy of the developed model. The NSC is recommended by various researchers, including the American Society of Civil Engineers (ASCE) Code. The range of values for the NSC is −∞ to 1, with values greater than 0.8 showing an accurate model. Equations (4) and (5) show the mathematical expressions for the calculations of RE% and NSC, respectively.

10-Fold Cross-Validation
The performance assessment of a developed machine learning model can be a difficult task, as the model cannot provide the required output based on the data that has not been used for training. A dataset is usually divided into training and testing phases, and then the outcome is evaluated based on statistical criteria, but this method is not applicable in all scenarios [43]. Therefore, to verify the generalized capability and reduce the overfitting problem of a learning model, the 10-fold cross-validation method has been recommended in the literature [17,44,45]. This method divides the whole dataset into 10 subclasses. Among all 10 subclasses, the first 9 are used for model training and the remaining subclass is used for validation. The same process is carried out for all the subsets and the output was expressed, employing the mean accuracy obtained in the 10 rounds. In our study, the same cross-validation method has been applied to validate the ANFIS model.

Data Pre-Processing
The collected data were statistically analyzed in order to check the consistency and reliability. The final dataset included the information of nine variables, that is, Ca 2+ , Mg 2+ , Na + , Cl − , SO 4 2− , HCO 3 − , pH, EC, and TDS. The data-refining process was performed using MATLAB 2019b. Figure 5 shows the data pre-processing and outlier elimination from the targeted EC and TDS data. A two-sided outlier detection approach was adopted and the threshold outlier values were set to ±3σ (sigma rule). Any values outside this threshold were identified as outliers and removed, as illustrated in Figure 5, for EC and TDS. Similarly, the data cleaning and outlier removal procedure was applied to the other parameters as well.

Input Optimization
The selection of the best input combination serves as a base for the accurate prediction of the desired output. If the number of variables is high, the computational time will be high, as will the number of combinations [46]. For the input optimization process, firstly the dataset was separated into two subsets, that is, the odd values were selected for the model training phase, while the even values were selected for the testing phase. Secondly, the stepwise ANFIS exhaustive search function for input optimization was applied to identify the most relevant inputs for modeling the EC and TDS levels. The input optimization methods have been successfully employed in many research studies for robust model development [47][48][49].

Input Optimization
The selection of the best input combination serves as a base for the accurate prediction of the desired output. If the number of variables is high, the computational time will be high, as will the number of combinations [46]. For the input optimization process, firstly the dataset was separated into two subsets, that is, the odd values were selected for the model training phase, while the even values were selected for the testing phase. Secondly, the stepwise ANFIS exhaustive search function for input optimization was applied to identify the most relevant inputs for modeling the EC and TDS levels. The input optimization methods have been successfully employed in many research studies for robust model development [47][48][49].
The following Figure 6a,b shows the exhaustive error results, represented by the RMSE values for the training and testing datasets. The first three input variables (Ca 2+ , Na + , and Cl − ) are the most correlated and relevant variables to the targeted output. This combination was selected because it showed the lowest RMSE values in the training data. It can be noticed that some of the other combinations have lower errors in the testing data; however, they showed high training errors. Therefore, based on the results of the input optimization using the ANFIS exhaustive search method, the three more relevant inputs were Ca 2+ , Na + , and Cl − in modeling the EC concentrations, while Mg 2+ , HCO 3 − , and SO 4 2− were selected to model the TDS levels. These selected input parameters are the most correlated with the variation in the target output concentrations in each case. Various research studies reported that Ca 2+ and Cl − are important parameters for EC, while Mg 2+ , total hardness, and SO 4 2− , along with other parameters, are the effective inputs to model TDS [17,20,50].

ANFIS Model Development
Upon defining the best input combination, the development of the best ANFIS structure was conducted by applying various types and numbers of membership functions (MFs), and different rules and epoch numbers. This was performed to test all the possibilities of the ANFIS parameters and compare their abilities in modeling the surface water quality (EC and TDS). In the ANFIS modeling, 70% of the data points were used for training, while the remaining data were used for testing. Table 2 presents the performances of various ANFIS models using 2-5 MFs. The optimum MF number was 3 for both EC and TDS modeling, which gave the lowest modeling errors and the highest R values. Each optimum MF was assigned to handle each input parameter.
There were 8 different types of MFs used to select the optimum MF type, as shown in Table 3. These types were: triangular MF (Trimf), trapezoidal MF (Trapmf), generalized bell curve MF (Gbellmf), Gaussian curve MF (Gaussmf), two-sided Gaussian MF (Gauss2mf), pi-shaped curve MF (Pimf), the composed difference between two sigmoidal MFs (Dsigmf), and the product of two sigmoid MFs (Psigmf). Table 3 compares the resulted errors (RMSE and MAE) and R value of applying the previously mentioned MF types for training, testing, and overall datasets. In modeling the EC concentrations, the triangular MF gave the lowest errors in all the datasets and the best performance, and outperformed the other MF types. Meanwhile, for TDS concentrations, the Gaussian curve MF showed the lowest modeling errors for training, testing, and overall datasets.

ANFIS Model Development
Upon defining the best input combination, the development of the best ANFIS structure was conducted by applying various types and numbers of membership functions (MFs), and different rules and epoch numbers. This was performed to test all the possibilities of the ANFIS parameters and compare their abilities in modeling the surface water quality (EC and TDS). In the ANFIS modeling, 70% of the data points were used for training, while the remaining data were used for testing. Table 2 presents the performances of various ANFIS models using 2-5 MFs. The optimum MF number was 3 for both EC and TDS modeling, which gave the lowest modeling errors and the highest R values. Each optimum MF was assigned to handle each input parameter.
There were 8 different types of MFs used to select the optimum MF type, as shown in Table 3. These types were: triangular MF (Trimf), trapezoidal MF (Trapmf), generalized bell curve MF (Gbellmf), Gaussian curve MF (Gaussmf), two-sided Gaussian MF (Gauss2mf), pi-shaped curve MF (Pimf), the composed difference between two sigmoidal MFs (Dsigmf), and the product of two sigmoid MFs (Psigmf). testing, and overall datasets. In modeling the EC concentrations, the triangular MF gave the lowest errors in all the datasets and the best performance, and outperformed the other MF types. Meanwhile, for TDS concentrations, the Gaussian curve MF showed the lowest modeling errors for training, testing, and overall datasets.  The selection of the optimum epoch number is a very significant factor in ANFIS modeling. Increasing the epoch number does not always mean enhancing the performance of ANFIS modeling. Usually, the modeling errors decrease by increasing the epoch number to a point, and then the errors increase afterward. Identifying this point is a necessity in ANFIS modeling. The previously selected ANFIS parameters and varying epoch numbers are displayed in Figure 7. From the plots, RMSE values vary for both the training and testing datasets with increases in the epoch number (until 50 epochs). The optimum epoch number for EC modeling was 3, while for TDS it was 3 or 10 epochs. Using these epoch numbers, they give the lowest modeling errors and also avoid the model's overfitting problem.
The full descriptions of the final ANFIS models in modeling the EC and TDS concentrations are listed in Table 4. In EC modeling, the optimum ANFIS structure consists of three triangular MFs (each MF presents one input), and 27 rules, and being trained for 3 epochs. While, for TDS modeling, the optimum ANFIS structure consists of three Gaussmf MFs (each MF presents one input), and 27 rules, and being trained for 3 or 10 epochs to prevent overfitting. The RMSE results were 30.61 and 16.72 for the EC and TDS modeling, respectively. However, the fitting results for R and R 2 were 0.909 and 0.827 for EC modeling and 0.915 and 0.838 for TDS modeling, respectively. The NSC results were very close to R 2 , which demonstrates the accurate performance of the resulted models in simulating the desired parameters. Figure 8 demonstrates the assigned rules in the optimum ANFIS model structure for modeling (a) EC and (b) TDS level, respectively. The selection of the optimum epoch number is a very significant factor in ANFIS modeling. Increasing the epoch number does not always mean enhancing the performance of ANFIS modeling. Usually, the modeling errors decrease by increasing the epoch number to a point, and then the errors increase afterward. Identifying this point is a necessity in ANFIS modeling. The previously selected ANFIS parameters and varying epoch numbers are displayed in Figure 7. From the plots, RMSE values vary for both the training and testing datasets with increases in the epoch number (until 50 epochs). The optimum epoch number for EC modeling was 3, while for TDS it was 3 or 10 epochs. Using these epoch numbers, they give the lowest modeling errors and also avoid the model's overfitting problem.

Model Statistical and Error Assessment
As discussed earlier, the efficiency of the developed ANFIS models was assessed in terms of statistical analysis and error assessment tests. Figure 9 illustrates the comparative evaluation of the observed and modeled simulated data. Moreover, Figure 10 shows the regression analysis between the observed and model-predicted results of EC and TDS concentrations by the selected structure of the ANFIS model. The modeling output shows an excellent correlation between the two datasets. An R 2 value above 0.9 was observed for both EC and TDS concentrations. Similarly, RMSE values below 20 µS/cm and 17 ppm were achieved by EC and TDS models, respectively. The modeling results show that the developed models are very efficient in modeling the surface water quality parameters, given the set of initial input parameters.
Besides statistical evaluation, the percent relative error (RE%) test was also conducted to check and demonstrate the accuracy of the proposed models. RE% plots are shown in Figures 11 and 12 for the optimum ANFIS model developed for EC and TDS, respectively. The results for both EC and TDS models show that the residual error of the data lies between +20% and −20%, describing the capacity of the developed ANFIS models for predicting the target output. Moreover, in both the modeling outputs, the max RE% results are under 60%, which indicates that, up to a limited extent, the ANFIS model underestimated the observed EC and TDS concentrations. However, the min RE% values are above −100% and −60% in the EC and TDS concentrations, respectively. The negative RE% results mean that the models overestimated the targeted salinity levels to a certain extent. Overall, the results exposed excellent accuracy of the models in predicting the EC and TDS.
The full descriptions of the final ANFIS models in modeling the EC and TDS concentrations are listed in Table 4. In EC modeling, the optimum ANFIS structure consists of three triangular MFs (each MF presents one input), and 27 rules, and being trained for 3 epochs. While, for TDS modeling, the optimum ANFIS structure consists of three Gaussmf MFs (each MF presents one input), and 27 rules, and being trained for 3 or 10 epochs to prevent overfitting. The RMSE results were 30.61 and 16.72 for the EC and TDS modeling, respectively. However, the fitting results for R and R 2 were 0.909 and 0.827 for EC modeling and 0.915 and 0.838 for TDS modeling, respectively. The NSC results were very close to R 2 , which demonstrates the accurate performance of the resulted models in simulating the desired parameters. Figure 8 demonstrates the assigned rules in the optimum ANFIS model structure for modeling (a) EC and (b) TDS level, respectively.

Model Statistical and Error Assessment
As discussed earlier, the efficiency of the developed ANFIS models was assessed in terms of statistical analysis and error assessment tests. Figure 9 illustrates the comparative evaluation of the observed and modeled simulated data. Moreover, Figure 10 shows the regression analysis between the observed and model-predicted results of EC and TDS concentrations by the selected structure of the ANFIS model. The modeling output shows an excellent correlation between the two datasets. An R 2 value above 0.9 was observed for both EC and TDS concentrations. Similarly, RMSE values below 20 µ S/cm and 17 ppm were achieved by EC and TDS models, respectively. The modeling results show that the developed models are very efficient in modeling the surface water quality parameters, given the set of initial input parameters. Besides statistical evaluation, the percent relative error (RE%) test was also conducted to check and demonstrate the accuracy of the proposed models. RE% plots are shown in Figures 11 and 12 for the optimum ANFIS model developed for EC and TDS, respectively. The results for both EC and TDS models show that the residual error of the data lies between +20% and −20%, describing the capacity of the developed ANFIS models for predicting the target output. Moreover, in both the modeling outputs, the max RE% results are under 60%, which indicates that, up to a limited extent, the ANFIS model underestimated the observed EC and TDS concentrations. However, the min RE% values are above −100% and −60% in the EC and TDS concentrations, respectively. The negative RE% results mean that the models overestimated the targeted salinity levels to a certain extent. Overall, the results exposed excellent accuracy of the models in predicting the EC and TDS.

Variation of Model Output with Selected Inputs
As discussed previously (Section 3.2), the input optimization was applied to select the best and optimum combination of input parameters. The optimization process revealed that Combination 1 (Ca 2+ , Na + , and Cl − ) is the most optimum to model EC concentrations, while Combination 2 (Mg 2+ , HCO 3 − , and SO 4 2− ) are the most correlated to model the TDS concentrations. Figures 13 and 14, respectively, demonstrate the 3D surface plots of the selected input combinations and the target outputs. An increasing and fluctuating trend in Figure 13 can be observed for the EC concentrations, with the variation in input variables, that is, Ca 2+ , Na + , and Cl − . Furthermore, a similar result can also be seen in the TDS modeling (Figure 14), where the result shows a linearly increasing trend of TDS with all the input variables. The increasing tendency of the EC and TDS with all the input combinations of parameters may be attributed to the fact that both outputs, that is, EC and TDS, are directly related to the salt concentration of the water. Therefore, any change in the salts or ion concentrations in the water can directly affect levels of both the EC and TDS. The same trend of output was reported in many studies where the concentration of dissolved solids and conductivity were closely associated with ions and salt concentrations in water [17,20,50].

Cross-Validation Output
The cross-validation method was used in accessing the ANFIS models for EC and TDS. The output is graphically illustrated in Figure 15 using RMSE and R as assessment criteria. A deviation in the validation output can be observed for the single-fold subclass. Nevertheless, the results demonstrate a good mean accuracy in the 10-folds. The average RMSE values obtained by the EC and TDS models were 3.8 µS/cm and 4.2 ppm, respectively. The mean R values accomplished by the EC and TDS models were 0.81 and 0.77, respectively. In the 10-folds, the minimum and maximum R values of 0.48 and 0.83 were respectively obtained during the model validation for TDS. The lowest RMSE value, 2.56 ppm, was accomplished for TDS in the second-fold subclass. Evidently, the output of the cross-validation method demonstrated efficient performance and generalized results of the ANFIS models, indicating that the model can accomplish good results on unseen data as well.

Variation of Model Output with Selected Inputs
As discussed previously (Section 3.2), the input optimization was applied to select the best and optimum combination of input parameters. The optimization process revealed that Combination 1 (Ca 2+ , Na + , and Cl − ) is the most optimum to model EC concentrations, while Combination 2 (Mg 2+ , HCO3 − , and SO4 2− ) are the most correlated to model the TDS concentrations. Figures 13 and 14, respectively, demonstrate the 3D surface plots of the selected input combinations and the target outputs. An increasing and fluctuating trend in Figure 13 can be observed for the EC concentrations, with the variation in input variables, that is, Ca 2+ , Na + , and Cl -. Furthermore, a similar result can also be seen in the TDS modeling (Figure 14), where the result shows a linearly increasing trend of TDS with all the input variables. The increasing tendency of the EC and TDS with all the input combinations of parameters may be attributed to the fact that both outputs, that is, EC and TDS, are directly related to the salt concentration of the water. Therefore, any change in the salts or ion concentrations in the water can directly affect levels of both the EC and TDS. The same trend of output was reported in many studies where the concentration of dissolved solids and conductivity were closely associated with ions and salt concentrations in water [17,20,50].     The cross-validation method was used in accessing the ANFIS models for EC and TDS. The output is graphically illustrated in Figure 15 using RMSE and R as assessment criteria. A deviation in the validation output can be observed for the single-fold subclass. Nevertheless, the results demonstrate a good mean accuracy in the 10-folds. The average RMSE values obtained by the EC and TDS models were 3.8 µ S/cm and 4.2 ppm, respectively. The mean R values accomplished by the EC and TDS models were 0.81 and 0.77, respectively. In the 10-folds, the minimum and maximum R values of 0.48 and 0.83 were respectively obtained during the model validation for TDS. The lowest RMSE value, 2.56 ppm, was accomplished for TDS in the second-fold subclass. Evidently, the output of the cross-validation method demonstrated efficient performance and generalized results of the ANFIS models, indicating that the model can accomplish good results on unseen data as well.

Discussion
The focus of the present study, that is, data pre-processing, input optimization, and optimum ANFIS model development, has been presented in the previous sections. The majority of published work in modeling surface water quality parameters have used the standalone ANN, GEP, SVM, DT, RF, and regression-based models. Modeling and predicting water quality parameters with classical AI techniques cannot provide the desired outcomes. Therefore, it is essential to employ the modeling methods with optimization algorithms for effective and precise modeling outputs.
Based on comparative analysis among the current and previous studies that applied the ANFIS modeling and optimization technique, Al-Mukhtar et al. (2019) [20] reported that ANFIS performed better than ANN and regression model in predicting TDS and EC. Moreover, better results of the ANFIS model with the particle swarm optimization algorithm (PSO) were reported by Azad [52] concluded that the ANFIS

Discussion
The focus of the present study, that is, data pre-processing, input optimization, and optimum ANFIS model development, has been presented in the previous sections. The majority of published work in modeling surface water quality parameters have used the standalone ANN, GEP, SVM, DT, RF, and regression-based models. Modeling and predicting water quality parameters with classical AI techniques cannot provide the desired outcomes. Therefore, it is essential to employ the modeling methods with optimization algorithms for effective and precise modeling outputs.
Based on comparative analysis among the current and previous studies that applied the ANFIS modeling and optimization technique, Al-Mukhtar et al. (2019) [20] reported that ANFIS performed better than ANN and regression model in predicting TDS and EC. Moreover, better results of the ANFIS model with the particle swarm optimization algorithm (PSO) were reported by Azad et al. (2019) [5] in predicting various water quality parameters. Khadr et al. (2017) [51] and Tiwari et al. (2018) [52] concluded that the ANFIS model is efficient to forecast phosphorus and nitrogen and other water quality parameters. Furthermore, Azad et al. (2018) [21] reported that the proficiency of the ANFIS model in modeling water quality parameters could be improved with optimization algorithms. Sun et al. (2019) [53] used the variable mode decomposition (VDM) and least square support vector machine (LSSVR) methods for outlier detection and correction in water quality data. The authors reported the accurate performance of the aforementioned methods and improved water quality data. Alameddine et al. (2010) [54] compared the performance of three outlier detection approaches, namely minimum covariance determinant (MCD), minimum volume ellipsoid (MVE), and M-estimators, in detecting and removing the outliers from lake water quality data. The results of the study revealed the M-estimators as a robust and flexible method in dealing with inconsistent water quality data.
The available literature shows that a limited number of studies utilized the data pre-processing and the ANFIS modeling, coupled with an exhaustive search for inputs, which was successfully integrated into this study. Consistent water quality datasets, optimized modeling inputs, and a computational efficient ANFIS structure could be achieved by adopting the methods used in this study. Moreover, the integrated optimization algorithms are more effective in providing robustness models with enhanced outputs than standalone ANN, SVM, GEP, RF, and other regression models.

Conclusions
In developing countries, the financial constraint and lack of facilities and infrastructure encourage further research to develop accurate and computationally efficient models that require a minimum number of parameters for surface water quality prediction. The current study reported the development and applications of the ANFIS modeling technique for surface water quality prediction, that is, EC and TDS, in one of the major rivers in Asia, the upper Indus River Basin. The data inputs were Ca 2+ , Mg 2+ , Na + , Cl − , SO 4 2− , HCO 3 − , pH, EC, and TDS, collected monthly over a period of 30 years (1975-2005). The specific outputs of this study are as follows;

•
The two-sided outlier detection approach was found to be efficient in data preprocessing and outlier removal to get homogenous and consistent data records for modeling.

•
The input optimization process reduced the modeling complexity by evaluating the optimum number of inputs, which is helpful in reducing data processing and collection efforts, and therefore highlighting the strong ability of the exhaustive search method to reduce noise in the data.

•
The developed ANFIS model showed strong agreement with the actual data for training as well as testing data. The ANFIS model was able to model the quality of surface water efficiently using the selected inputs. This could be attributed to the structure of the ANFIS model, which incorporates the advantages of fuzzy reasoning and the self-learning capability of neural networks.

•
Conclusively, the ANFIS model could be efficiently utilized in water quality assessments and mitigation studies.