Forecasting Water Quality Index in Groundwater Using Artificial Neural Network

Groundwater quality monitoring in the vicinity of drilling sites is crucial for the protection of water resources. Selected physicochemical parameters of waters were marked in the study. The water was collected from 19 wells located close to a shale gas extraction site. The water quality index was determined from the obtained parameters. A secondary objective of the study was to test the capacity of the artificial neural network (ANN) methods to model the water quality index in groundwater. The number of ANN input parameters was optimized and limited to seven, which was derived using a multiple regression model. Subsequently, using the stepwise regression method, models with ever fewer variables were tested. The best parameters were obtained for a network with five input neurons (electrical conductivity, pH as well as calcium, magnesium and sodium ions), in addition to five neurons in the hidden layer. The results showed that the use of the parameters is a convenient approach to modeling water quality index with satisfactory and appropriate accuracy. Artificial neural network methods exhibited the capacity to predict water quality index at the desirable level of accuracy (RMSE = 0.651258, R = 0.9992 and R2 = 0.9984). Neural network models can thus be used to directly predict the quality of groundwater, particularly in industrial areas. This proposed method, using advanced artificial intelligence, can aid in water treatment and management. The novelty of these studies is the use of the ANN network to forecast WQI groundwater in an area in eastern Poland that was not previously studied—in Lublin.


Introduction
Groundwater is one of the key water supply sources for people worldwide. Its quality is an issue of grave importance, as it is directly related to human health [1]. Consumption of contaminated water leads to health problems, as well as increased morbidity and mortality [2].
Industrial activities present a dire threat to the quality of groundwater [3]. The accelerated technological development of our civilization, which is fueled by the energy from crude oil and gas, has contributed to significant pollution of waters with oil derivatives among others [4].
Water quality indices (WQI) facilitate the water quality assessment. In some countries, they have been introduced into legal acts and are often used by the authorities supervising water quality [5]. One of such indicators is the water quality index (WQI), which is an established and one of the most effective tools for assessing water quality [6]. It is a stable, accurate unit of measure that informs about water quality [7]. No single parameter that characterizes the chemical, physical or biological properties of water can adequately express the WQI. WQI is typically assessed by measuring a wide range of parameters (e.g., temperature, pH, electrical conductivity (EC), turbidity, organic matter, metals). The advantage that WQI has over other assessment methods is that it determines the overall state of water quality without performing an individual interpretation of particular Prediction of the water quality in rivers [17], lakes [18], as well as the quality of seawater [19], and drinking water [20] has been investigated by numerous researchers. However, the research gap defined there was no research conducted, the groundwater WQI prediction in drilling areas in Poland is conducted for the first time.
The aim of the research was to optimize the ANN regarding the selection of input variables for the prediction of groundwater quality index in mining areas. The first step of the research was to determine the variables that have a significant impact on the WQI values. For this purpose, it was the multiple regression model that provided the basis for subsequent analyses, as well as enabled determining key variables and network validation.
The main research contribution is as follows: 1. assessment of the groundwater quality in the vicinity of the shale gas drilling rig by calculating the WQI; 2.
demonstration of the computational prowess of ANNs to create the models capable of effectively predicting WQI; 3.
defining the general framework of the ANN model (selection of the network type; selection of appropriate input data from multiple regression, selection of the number of hidden neurons; specification of the optimal setting of network learning parameters); 4.
optimizing the number of parameters and the quality of the WQI prediction model; 5.
creating a neural network model that can be used to directly predict the groundwater quality status and, thus, provide an alternative to the existing WQI calculation methods.

Research Object
The analyzed water samples were taken from 17 dug wells located in the village of Syczyn and two deepwater wells, which constitute drinking water intakes for the Wierzbica commune in the Lubelskie Voivodeship in Poland ( Figure 1). The sampling area is flattened and its height is in the range of 178-181 m above sea level. The study covered a rural area where agricultural activities are carried out. The dug wells are located in the immediate vicinity of the Natural Gas Processing Plant, where the shale gas well is located.
The aim of the research was to optimize the ANN regarding the selecti variables for the prediction of groundwater quality index in mining areas. Th of the research was to determine the variables that have a significant impact values. For this purpose, it was the multiple regression model that provided t subsequent analyses, as well as enabled determining key variables and netw tion.
The main research contribution is as follows: 1. assessment of the groundwater quality in the vicinity of the shale gas dr calculating the WQI; 2. demonstration of the computational prowess of ANNs to create the mod of effectively predicting WQI; 3. defining the general framework of the ANN model (selection of the ne selection of appropriate input data from multiple regression, selection of of hidden neurons; specification of the optimal setting of network learni ters); 4. optimizing the number of parameters and the quality of the WQI predict 5. creating a neural network model that can be used to directly predict the g quality status and, thus, provide an alternative to the existing WQI calcul ods.

Research Object
The analyzed water samples were taken from 17 dug wells located in th Syczyn and two deepwater wells, which constitute drinking water inta Wierzbica commune in the Lubelskie Voivodeship in Poland ( Figure 1). Th area is flattened and its height is in the range of 178-181 m above sea level covered a rural area where agricultural activities are carried out. The dug wells in the immediate vicinity of the Natural Gas Processing Plant, where the shale located.   The closest of the wells is 220 m and the farthest is 600 m away from the plant site. The two deepwater wells are located 4.9 km and 6.75 km away on the opposite sides of the gas drilling rig. The groundwater samples were stored in a laboratory refrigerator at 4 • C until subjected to the standard analysis procedures in accordance with the American Public Health Association [21]. Table 1 specifies the methods and standards followed in the water quality parameters testing. The methodology for the development of the WQI, MLR and ANN models is shown in the flowchart ( Figure 2). The methodology for the development of the WQI, MLR and ANN models is shown in the flowchart ( Figure 2).

Water Quality Index
The Water Quality Index (WQI) is an assessment technique that reflects the complex impact of individual water quality parameters on the overall quality of water [22]. Typ- ically, for the calculation of WQI, 10 important water quality parameters were selected, according to the drinking water quality standard recommended by the World Health Organization [23].
In this study, the physicochemical parameters in the three-step analysis of WQI were pH, EC, TDS, TH, Ca 2+ , Mg 2+ , Na + , K + , Cl − , HCO 3 − , SO 4 2− , NO 3 − and PO 4 3− . In the first step, weights (w i ) were assigned to each parameter on a 1-5 scale, reflecting their relative importance for drinking water quality. The weights for individual water parameters are presented in Table 2 [24]. In the second step, relative weights (W i ) were calculated from Equation (1): where W i is the relative weight, w i is the weight of each parameter, n is the number of parameters. Thirdly, from the Formula (2) below, a quality rating scale (q i ) was assigned to each parameter, by dividing its concentration in the water samples by its respective standard as per the WHO guidelines [23] where q i is the quality rating, C i is the concentration of each chemical parameter in each water sample in milligrams per liter; S i is the drinking water standard for each chemical parameter in milligrams per liter accordingly. WQI was derived from Equations (3) and (4) where SI i is the subindex of i-th parameter, q i is the rating based on the concentration of i-the parameter, n is the number of parameters. Table 3 presents the results from laboratory tests of physicochemical parameters of well water, whereas Table 4 shows the classification according to the range and type of WQI of the water in question.   Table 4. The range and type of water for WQI [25].

Selection of the ANN Input Parameters
Multiple regression was employed to establish the input parameters of the ANN model. The ANN was modeled in the R environment in the RStudio 1.0.153 program (Rstudio, Boston, MA, USA). The quality of the models was verified using the residual normality distribution with the application of the Shapiro-Wilk test. On the basis of the diagnostic plots of residues distribution against the predicted residues, the homoscedasticity of the residues and the absence of autocorrelation of residues were tested. Cook's distance checked the existence of outliers in the regression model. The multiple regression model allows examining the influence of a number of independent variables, i.e., water quality parameters (X 1 , X 2 , . . . , X n ) on the dependent variable WQI (Y). The Multiple Regression Model (5) is given below: where: β j -model parameters (regression coefficients) describing the influence of the j-th variable, ε-random component (model residuals).
The measure of the model fitness is the multiple determination coefficient R 2 , the value of which is in the range <0;1>, calculated from the Formula (6): where: y i -the value of water quality index for the i-th observation received from the model y i -the actual value of water quality index for the i-th observation The coefficient involves comparing the variance of model residuals with the total variance of data. The closer it is to one, the better the fit to data. Conversely, the closer it is to zero, the worse the fit to data [26].

Modeling of the ANN
WQI prediction with the application of ANNs was performed using the Neural Network library in MatLab and Simulink computing environments (MathWorks, Massachusetts, USA). The number of input parameters was derived from the multiple regression analysis. The input parameters of the model were the physicochemical parameters EC, pH, Ca, Mg, PO 4 -P, K, SO 4 2− (7 input neurons), and the output neuron-WQI. Subsequently, using the stepwise regression method, the models with ever fewer variables were tested. Modeling was performed using the Neural Network Fitting app, and the Levenberg-Marquardt algorithm for training. A two-layer feed-forward network with sigmoid hidden neurons and linear output neurons was used. The networks included one hidden layer. According to the literature, networks with a higher number of hidden layers are better fitting for higher complexity cases [27]. The number of neurons in the hidden layer (2 ÷ 10) was selected experimentally, as well as reflected the mean square error and the regression value R scores. The training set was 70%, and the test and validation sets were 15% each. A schematic representation of the ANN for the first model is shown in Figure 3. ̂-the value of water quality index for the i-th observation received from the model -the actual value of water quality index for the i-th observation The coefficient involves comparing the variance of model residuals with the total variance of data. The closer it is to one, the better the fit to data. Conversely, the closer it is to zero, the worse the fit to data [26].

Modeling of the ANN
WQI prediction with the application of ANNs was performed using the Neural Network library in MatLab and Simulink computing environments (MathWorks, Massachusetts, USA). The number of input parameters was derived from the multiple regression analysis. The input parameters of the model were the physicochemical parameters EC, pH, Ca, Mg, PO4-P, K, SO4 2− (7 input neurons), and the output neuron-WQI. Subsequently, using the stepwise regression method, the models with ever fewer variables were tested. Modeling was performed using the Neural Network Fitting app, and the Levenberg-Marquardt algorithm for training. A two-layer feed-forward network with sigmoid hidden neurons and linear output neurons was used. The networks included one hidden layer. According to the literature, networks with a higher number of hidden layers are better fitting for higher complexity cases [27]. The number of neurons in the hidden layer (2 ÷ 10) was selected experimentally, as well as reflected the mean square error and the regression value R scores. The training set was 70%, and the test and validation sets were 15% each. A schematic representation of the ANN for the first model is shown in Figure  3. Two measures of network quality were employed-mean square error (MSE) and regression R-value. MSE was calculated from the Equation below (7): where n-number of cases in a given set; y′i-the actual value of water quality index for the i-th observation; y * i-a predicted value of water quality index for the i-th observation The regression coefficient R measures the correlation between outputs and inputs and is derived from (8): where σy′-standard deviation of reference values, σy*-standard deviation of predicted values.
The higher the regression coefficient R and the lower the MSE, the better quality of the generated network is. Two measures of network quality were employed-mean square error (MSE) and regression R-value. MSE was calculated from the Equation below (7): where n-number of cases in a given set; y i -the actual value of water quality index for the i-th observation; y * i -a predicted value of water quality index for the i-th observation. The regression coefficient R measures the correlation between outputs and inputs and is derived from (8): where σy -standard deviation of reference values, σy*-standard deviation of predicted values. The higher the regression coefficient R and the lower the MSE, the better quality of the generated network is. In order to compare the quality of WQI predictions for individual models, the root mean square error (RMSE) indicator will be used (9), calculated as: where MSE-mean square error.

Water Quality Index
The parameters of groundwater samples from the studied wells (Table 2) were juxtaposed with the drinking water quality guidelines defined by the WHO [23]. The contents of nitrate (V), chloride, magnesium, sodium, and bicarbonate (IV) ions did not exceed the limits in any of the tested wells. On the other hand, total hardness was several times higher than the value of TH specified in the WHO guidelines. Water pH was increased exclusively in well 15, and the concentration of sulfate (VI) in wells 18 and 19. The WHO standards were exceeded in the majority of wells for the following parameters: calcium concentration (except for wells 24 and 27-below 75 mg/L); potassium concentration (except for wells 16, 21-22, 28-below 10 mg/L); total dissolved solids (TDS) (except: 19, 21-26, 28, 30-below 500 mg/L) and for the phosphate (V) value (except wells 16,18,[22][23][24]30).
The calculated WQI is presented in Table 5. Wells 10, 15, 22, 24, 25, 28 are characterized by WQI < 50, which denotes water of excellent quality. The WQI value of the remaining wells was between 50.30 and 84.93, indicative of good water quality. None of the examined wells showed poor water quality. In the first step, the correlation matrix between the variables was analyzed. Due to the singular form of the matrix with a set of variables, the variables were reduced.

Multiple Regression Model
Given the high correlation coefficient between the variables of interest, i.e., EC (1.0) and TDS as well as TH and Ca (0.99), variables TDS and TH were removed. The remaining correlation coefficients between the independent variables were lower than 0.85 (the absolute value), which did not indicate an excessively strong correlation of the independent variables in the regression model.
After TDS and TH variables had been removed, it emerged that Cl − , SO 4 2− , NO 3 − N, Na were insignificant in the regression model. Despite the high R 2 of the model, 0.99, after verifying the assumptions, it was found that with a complete set of variables in the model, there is one outlier (Map 30) and no normal distribution of the residuals, for which reason, the model was rejected. The stepwise progressive regression produced a model that met the assumptions of the correct multiple regression: normal distribution of residuals, no residual autocorrelation, residual homoscedasticity, and no outliers. The coefficients in the model are shown in Table 6; the value of SO 4 2− emerged as an insignificant variable, most likely owing to redundancy associated with high correlation with other variables. The determination coefficient R 2 = 0.99 indicates a very good fit to the data model.
Positive values of the model coefficients indicate a positive relationship between the dependent variables and the WQI score, while the negative ones denote the contrary. The results show that all the coefficients for the dependent variables are positive; hence, when the physicochemical parameters, EC, pH, Ca, Mg, PO 4 3− , K, SO 4 2− increase, the WQI value increases accordingly, thereby signifying an increase in water pollution.

ANN Modeling and WQI Prediction
In accordance with the results from the multiple regression model, the WQI variable was modeled by means of ANNs using EC, pH, Ca, Mg, PO 4 3− , K, SO 4 2− as input variables. From the executed ANN modeling, the best results were obtained for the network with five neurons, after 14 iterations. Other data, including performance validation and the rate of error decrease (gradient), are shown in Figure 4. The ANN performance was validated with MSE. The best validation result was exhibited by the network's tenth iteration ( Figure 5). In general, the error was found to decrease over successive training periods, but it is likely that it would start to increase in the validation dataset in the case when the network began to overfit the training data. The training was designed to stop after six consecutive increases in the validation error (or no decrease), and the best results were obtained from the iteration with the lowest validation error. Table 7 shows the network training results (MSE and Regression R-value) categorized into training, validation, and testing subsets.
Energies 2021, 14, x FOR PEER REVIEW 9 of 17 most likely owing to redundancy associated with high correlation with other variables. The determination coefficient R 2 = 0.99 indicates a very good fit to the data model.
Positive values of the model coefficients indicate a positive relationship between the dependent variables and the WQI score, while the negative ones denote the contrary. The results show that all the coefficients for the dependent variables are positive; hence, when the physicochemical parameters, EC, pH, Ca, Mg, PO4 3− , K, SO4 2− increase, the WQI value increases accordingly, thereby signifying an increase in water pollution.

ANN Modeling and WQI Prediction
In accordance with the results from the multiple regression model, the WQI variable was modeled by means of ANNs using EC, pH, Ca, Mg, PO4 3− , K, SO4 2− as input variables. From the executed ANN modeling, the best results were obtained for the network with five neurons, after 14 iterations. Other data, including performance validation and the rate of error decrease (gradient), are shown in Figure 4. The ANN performance was validated with MSE. The best validation result was exhibited by the network's tenth iteration ( Figure 5). In general, the error was found to decrease over successive training periods, but it is likely that it would start to increase in the validation dataset in the case when the network began to overfit the training data. The training was designed to stop after six consecutive increases in the validation error (or no decrease), and the best results were obtained from the iteration with the lowest validation error. Table 7 shows the network training results (MSE and Regression R-value) categorized into training, validation, and testing subsets.     The regression value (R) for training data is 0.9984, for validation data-0.9984 and the test data-0.9786, thus, in each case R > 0.95, which is typical of a good fit of the network. The overall regression coefficient was 0.9966, which confirms the high degree of overlap between the measurement points and the fit line with the ideal Y = T prediction line.
Subsequently, using the stepwise regression method, the models with ever fewer variables were tested. In the first step, the models with one variable removed were created. The best-performing model, i.e., the one with the minimal error, was the model with the rejected SO4 2− variable- Table 8. For comparison, the worst performance was delivered when Ca had been rejected (also Table 8). Having established that rejecting the SO4 2− variable improves the quality of the model, another variable was removed from the model in question and the models were compared. This time, the best model was the model with SO4 2− and PO4 3− variables rejected, where the mean square error is 0.651358, compared to the full model-1.25755. After the next variable was rejected, the mean square error started to increase, which proved that the quality of the model was deteriorating.  The regression value (R) for training data is 0.9984, for validation data-0.9984 and the test data-0.9786, thus, in each case R > 0.95, which is typical of a good fit of the network. The overall regression coefficient was 0.9966, which confirms the high degree of overlap between the measurement points and the fit line with the ideal Y = T prediction line.
Subsequently, using the stepwise regression method, the models with ever fewer variables were tested. In the first step, the models with one variable removed were created. The best-performing model, i.e., the one with the minimal error, was the model with the rejected SO 4 2− variable- Table 8. For comparison, the worst performance was delivered when Ca had been rejected (also Table 8). Having established that rejecting the SO 4 2− variable improves the quality of the model, another variable was removed from the model in question and the models were compared. This time, the best model was the model with SO 4 2− and PO 4 3− variables rejected, where the mean square error is 0.651358, compared to the full model-1.25755. After the next variable was rejected, the mean square error started to increase, which proved that the quality of the model was deteriorating. The optimization of the number of parameters and the quality of the network consisted in minimizing RMSE and the maximum error (%) relative to the WQI value. Table 8 shows two versions of the models. The models with one rejected variable were tested each time. The model with the best and worst parameters is presented. It can be seen that up to a point, the models have lower RMSE values when rejecting variables. The best network was obtained for the network with the following parameters: EC, pH, Ca, Mg, K, for which RMSE = 0.651258 and max% error (2.98%). The RMSE value obtained for this network was of a similar magnitude as in the case of the original multiple regression model (RMSE = 0.06167). Moreover, the good quality of this network is proven by R = 0.9992 and R 2 = 0.9984. Next, after discarding the subsequent variables, the RMSE values began to increase and the properties of models deteriorated. Figure 6 presents a comparison of the experimental data and the data obtained from WQI prediction for the networks with optimal input parameters (EC, pH, Ca, Mg, K).
The optimization of the number of parameters and the quality of the network consisted in minimizing RMSE and the maximum error (%) relative to the WQI value. Table  8 shows two versions of the models. The models with one rejected variable were tested each time. The model with the best and worst parameters is presented. It can be seen that up to a point, the models have lower RMSE values when rejecting variables. The best network was obtained for the network with the following parameters: EC, pH, Ca, Mg, K, for which RMSE = 0.651258 and max% error (2.98%). The RMSE value obtained for this network was of a similar magnitude as in the case of the original multiple regression model (RMSE = 0.06167). Moreover, the good quality of this network is proven by R = 0.9992 and R 2 = 0.9984. Next, after discarding the subsequent variables, the RMSE values began to increase and the properties of models deteriorated. Figure 6 presents a comparison of the experimental data and the data obtained from WQI prediction for the networks with optimal input parameters (EC, pH, Ca, Mg, K). Despite the small discrepancies (RMSE = 0.06167), it can be concluded that the presented ANN models display an acceptable level of error and, therefore, emerge as reliable predictors supporting the decision-making processes [17].

Discussion
The drilling processes constitute a real threat involving the pollution of groundwater [28]. In accordance with legal regulations, the assessment of water quality in drilling areas must be controlled and constantly monitored. In the first part of this paper, the assessment of water quality indicators (Table 4) was performed according to the criteria defined by Sahu and Sikdar [25]. The results indicated that in 11 wells the water quality is good, and in 8, it is excellent (WQI < 50).
In the paper, an attempt was made to prove that ANN models are efficient tools for predicting the water quality, which can be employed for planning integrated water protection systems, while implementing good environmental management practices.
There are several advantages of employing the ANN method for predicting WQI over the computational approach. The latter method requires calculating the values of at least twelve water quality parameters (Table 1), which should be converted into partial indicators ( Table 2). The formula for calculating the WQI is basically performed on subindices. Therefore, it is much easier to apply an existing ANN model that uses raw data without the need for additional recalculation. As few as five water quality parameters are required to build the model, which, among other things, reduces the cost of water quality monitoring.
The water quality prediction models are largely focused on rivers. The research object in this study involved the wells located in Poland, in the vicinity of a shale gas drilling rig. Thus far, no ANN models of water quality in drilling areas have been created in Despite the small discrepancies (RMSE = 0.06167), it can be concluded that the presented ANN models display an acceptable level of error and, therefore, emerge as reliable predictors supporting the decision-making processes [17].

Discussion
The drilling processes constitute a real threat involving the pollution of groundwater [28]. In accordance with legal regulations, the assessment of water quality in drilling areas must be controlled and constantly monitored. In the first part of this paper, the assessment of water quality indicators (Table 4) was performed according to the criteria defined by Sahu and Sikdar [25]. The results indicated that in 11 wells the water quality is good, and in 8, it is excellent (WQI < 50).
In the paper, an attempt was made to prove that ANN models are efficient tools for predicting the water quality, which can be employed for planning integrated water protection systems, while implementing good environmental management practices.
There are several advantages of employing the ANN method for predicting WQI over the computational approach. The latter method requires calculating the values of at least twelve water quality parameters (Table 1), which should be converted into partial indicators ( Table 2). The formula for calculating the WQI is basically performed on subindices. Therefore, it is much easier to apply an existing ANN model that uses raw data without the need for additional recalculation. As few as five water quality parameters are required to build the model, which, among other things, reduces the cost of water quality monitoring.
The water quality prediction models are largely focused on rivers. The research object in this study involved the wells located in Poland, in the vicinity of a shale gas drilling rig. Thus far, no ANN models of water quality in drilling areas have been created in Poland. The presented modeling method can be adjusted for application in other drilling areas (coal, kerosene, gas). The authors believe that the application of artificial neural networks in various aspects of the environment constitutes a new approach to implementing good environmental management practices.
On the basis of results obtained via empirical studies on 19 wells located in the vicinity of a shale gas drilling rig, an ANN model of WQI was created for the water intakes located in shale gas drilling areas. Table 9 presents the results of the existing WQI models, including the one created by the authors. EC-electrical conductivity; pH-pondus hydrogenii; Ca-calcium, Mg-magnesium, K-potassium, TDS-total dissolved solids; N-NO 2 + , N-NO 3 -sum nitrite and nitrate; Cl − -chloride; SO 4 2− -sulphate; CO 3 2− -carbonate; HCO 3 − -bicarbonate; F − -fluorides; TH-total hardness; SAR-sodium absorption ratio; RSC-residual sodium carbonate; WT-water temperature; OS-oxygen saturation; NTU-turbidity; N-NO 3 ,-nitrate; P-PO 4 -phosphate; BOD-biochemical oxygen demand; COD-chemical oxygen demand; B-boron; SSP-soluble sodium percent; DO-dissolved oxygen; N-NH 3 -ammonium; SS-sodium soluble percentage; Fe-iron; Mn-manganese; Cu-copper; Cr-chromium, Zn-zinc; As-arsenic. The set of input variables in the model for WQI assessment can be composed in various ways. According to Nordin et al. (2021), the most common variable used for ANN modeling of the groundwater quality is SO 4 2− , followed by Cl − , Mg, Ca, pH, TDS, and TH [19]. The analysis indicated that the above-mentioned parameters are strongly correlated with each other and were frequently selected as input parameters for the groundwater quality evaluation [40]. Input parameters accurately verify the groundwater quality, both for drinking and irrigation. The correlation analysis conducted in the paper indicated a strong correlation between the selected input data.
The presented model structure was optimized in order to minimize the number of input variables to the model while not compromising the quality of predictions. The input factors for building the neural network model were selected from the results of multiple regression. ANN validation with the use of linear regression analysis was proposed, e.g., by Flavelle (1992) [41]. Schenker and Agarwal (1996) proved that comparing regression models should be based on the closest adjustment of R-value [42]. The regression coefficient of the developed model reached R2 = 0.99, which indicates a very good fit to the data model. The multiple regression model showed that EC, pH, Ca, Mg, PO 4 3− , K, SO 4 2− are significant variables in the prognosis of WQI.
On the basis of the regression model, optimal input data for ANN model creation were selected. On the basis of network quality indicators, R, R 2 and MSE, a network with five neurons in the hidden layer was selected as the best performer.
In order to develop a well-fitted and efficient ANN model (used for WQI prediction), it is necessary to appropriately select, i.e., • the number of neurons in the hidden layer • dataset division: training, validation, test • appropriate variables.
In the literature, there are various ways of dividing the dataset into training, testing, and validation data: Jian et al. (2020) created and optima WQI prediction model in the river using 70% training set as well as 15% test and 15% validation sets [34]; Nayayk et al. (2020), obtained an optimal WQI prediction model for the Godavari river using a training, validation, and testing set of 75%:15%:10%, respectively [32]. In turn, Al-Adhaileh and Alsaade created a WQI prediction model for selected rivers and lakes of India, using 70% for the training phase and 30% for the testing phase [30].
While creating the investigating model, the authors assumed the training, validation, and testing data ratio used by Jian et al. [34].
The number of neurons in the hidden layer constitutes another parameter used in developing a well-fitted model. The optimal number of neurons is determined using the trial-and-error method. It is expected that the number of hidden neurons should range from 11 to 11 [43]. In the presented studies, the model comprised five hidden neurons. Patki et al. (2021) developed an ANN model for WQI prediction in municipal distribution using the same number of neurons as in the model developed by the authors of this paper [30]. In turn, Yilma et al. (2018) obtained the most optimal model architecture using 15 hidden neurons that resulted in an R2 value of 0.93 [36].
The authors focused on selecting the least number of input data. The table shows that the number of input data for predicting the water quality differs, ranging from 6 [28] to 23 variables [39]. Moreover, the input data are rarely optimized for network creation, as in the case of the ANN model presented by Hore [14].
The models with a decreasing number of variables were tested using the backward stepwise regression method. First, the models with one variable removed were constructed to select the one with the minimum RMSE and the minimum percentage error in relation to the WQI value. In subsequent steps, other variables were removed until the WQI values began to increase in two subsequent iterations. From RMSE minimization and minimization of the maximum % error in relation to the WQI value, the optimal number of network parameters (EC, pH, Ca, Mg, K) was determined, under the constraint of maintaining the appropriate quality of this network (RMSE = 0.651258, max % error-2.98%). It was possible to create a model using only five types of input data, which showed an R = 0.9992, R 2 = 0.9984; RMSE for the training set 0.0292, test set 0.2131, and validation set 1.3244. While analyzing the fit of the model to the actual data (comparing R and R 2 ), the superior quality of the network was obtained in comparison to other presented models, even though fewer input parameters were used in the research (compared to other studies- Table 9).
The WQI predictions computed using the 5-5-1 network and trained by the Levenberg-Marquardt algorithm are strongly and positively correlated with WQI values of interest (r = 0.999, p < 0.05). This is further confirmed in Figure 6, delivering the comparison of the ANN-predicted WQI and the values of the models calculated on the basis of real data. On its basis, it can be concluded that the overall agreement between the real and simulated data is satisfactory. In turn, in the studies by Gazz et al. (2012), the relationship between the measured and simulated WQI of rivers (R = 0.977, p < 0.01; R 2 = 95.4%; RMSE = 1.663) was obtained using the ANN model.
Koudari et al. (2021) created a WQIANN model for a region in south-east Algeria, characterized by the R 2 value of 0.997 [43].
While comparing the created model with others (Table 9-values of R, R 2 , and RMSE), it can be stated that the general goodness of fit between the measured and simulated data is satisfactory.
Artificial neural network models should be used more frequently for predicting the quality of waters, as well as in other aspects of environmental protection.
The use of neural network models is possible even in the case of data deficiency, e.g., due to the unavailability, difficulty or high-cost expense of obtaining the actual figures. It also allows for the reconstruction of missing data, reducing costs and saving time for otherwise indispensable analyses, which can be a strong economic factor in science.
The obtained findings and drawn conclusions should encourage authorities and water quality management bodies to invest in the use of neural network modeling as a comprehensive and high-quality alternative to the standard WQI calculation methods.

Conclusions
The paper presented an alternative to an analytical method of water quality index determination, i.e., artificial neural network modeling for WQI prediction. The data used for conducting the study involved the results of empirical studies from 19 wells located in the vicinity of a shale gas drilling rig in Poland.
An ANN model was created based on the ANN approach with a two-layer feedforward network with sigmoid hidden neurons and linear output neurons. The model was created using only five input parameters: EC, pH, Ca, Mg, and K. Optimal input data for ANN modeling were selected based on the results of multiple regression.
The ANN model was characterized by the correlation coefficient of 0.9992, as well as MSE equal to 0.0292, 0.2131, and 1.3244 for the training, testing and validating set, respectively. These results were obtained with the following set division: training (70%), validation (15%), and testing (15%), using the Neural Network Fitting app, and the Levenberg-Marquardt algorithm for training.
The advantage of using the ANN model proposed in this paper is the easy assessment of groundwater pollution levels; in addition, it enables the avoidance of lengthy calculations involved in prevalent conventional WQI.
In future works, the authors will attempt to effectively forecast water quality, using indicators that would depend on the proximity of pollution sources.
In future works, the authors are planning to employ other machine learning methods for groundwater quality assessment, including: deep neural network, Logistic Regression (LogR), Random Forest (RF). Decision Tree (DT) and K Nearest Neighbor (KNN). The results obtained using these methods will be compared and their impact on prediction quality will be investigated.