Use of Factor Analysis (FA), Artificial Neural Networks (ANNs), and Multiple Linear Regression (MLR) for Electrical Conductivity Prediction in Aquifers in the Gallikos River Basin, Northern Greece

Due to the fact of water resource deterioration from human activities and increased demand over the last few decades, optimization of management practices and policies is required, for which more reliable data are necessary. Cost and time are always of importance; therefore, methods that can provide low-cost data in a short period of time have been developed. In this study, the ability of an artificial neural network (ANN) and a multiple linear regression (MLR) model to predict the electrical conductivity of groundwater samples in the GallikosRiver basin, northern Greece, was examined. A total of 233 samples were collected over the years 2004–2005 from 89 sampling points. Descriptive statistics, Pearson correlation matrix, and factor analysis were applied to select the inputs of the water quality parameters. Input data to the ANN and MLR were Ca, Mg, Na, and Cl. The best results regarding the ANN were provided by a model that included one hidden layer of three neurons. The mean absolute percentage error, modeling efficiency, and root mean square error were used to evaluate the performances of the methods and to compare the prediction capabilities of the ANN and MLR. We concluded that the ANN and MLR models were valid and had similar accuracy (using the same inputs) with a large number of samples, but in the case of a smaller data set, the MLR showed a better performance.


Introduction
The continuous increase in the global population over the last few decades and the improvement in human well-being in developed countries have led to increased demands for food production. According to Cay and Uyan [1], the total irrigated land area worldwide increased by more than five times from 1900 to 2000. The increase in agricultural and industrial production resulted in the increased introduction of chemical compounds into water resources [2,3]. Nowadays, nitrates and pesticides are among the most common pollutants of drinking and irrigation water resources [4]. In many areas of the world, water scarcity is mainly due to thefact of quality, rather than quantity. The supply of drinking water is a priority for modern societies [5]. Therefore, the optimal management of this resource is important to meeting increasing demands. The key element in effective management is the assessment of water quality, the identification of pollutants and their source, and the monitoring of the pollutants' fluctuations over time. Water quality is determined by assessing biological, chemical, and physical parameter values [6,7]. Assessment relies on standards developed by competent authorities of each country, which set maximum permissible concentrations of certain chemicals allowed in water. Groundwater is one of the most important natural resources for drinking and irrigation purposes [8] and supports the socio-economic development of countries. The advantages of groundwater compared to surface water are its higher quality, lower rate of evapotranspiration, and lower vulnerability to contamination [9,10]. Globally, agriculture is the main consumer of groundwater [11]. Conventional investigations of groundwater quality are mainly based on data and measurements performed in the field and analysis of groundwater sample parameters carried out at the laboratory. The selection of parameters to be monitored depends on the objectives of the study and the available funding [12].
Despite the detailed planning and design of the sampling procedure, there are often restrictions regarding availability of time, sampling point accessibility, and lack of funds. Therefore, evaluation of water quality using conventional methods results in economic costs and reduce the decision-making capacity and effectiveness of management programs [13]. In order to overcome such data-scarcity problems, researchers have shown an interest in and increased use of descriptive statistical analysis, multivariate statistical analysis, and artificial neural networks for the evaluation of hydrochemical data in the field of hydrogeology. Multivariate statistics can be used to identify hydrochemical-hydrogeological procedures that determine groundwater quality characteristics [14,15] and distinguish anthropogenic from geological impacts on groundwater composition control [16].
In the last few decades, artificial neural networks (ANNs) have been widely applied in the area of water quality modeling. They are considered a prediction tool and have been widely used in various fields such as flood prediction [17,18], land use [19], and water quality [20], or to predict parameter values such as electrical conductivity and total dissolved solids based on other variables measurements [21][22][23][24][25][26]. They have also been used in hydrogeology to determine aquifer parameters [27][28][29], evaluate the qualitative characteristics of groundwater [30], and predict groundwater level [31][32][33][34]. ANNs are information processing systems consisting of nonlinear interconnected processing elements called neurons [35].
Regression models are best for establishing an association between dependent and independent variables, and they are considered the simplest and most straightforward form of model. They are based on the method of least squares and are usually considered for the first stage of an investigation of the relationship among variables.
The electrical conductivity value is an index of salinity, and it is often used as an indicator of water quality for agricultural, industrial, or domestic demands. It can highlight the changes over time and space in an aquifer and is usually measured in situ, but the measurement methods are usually timeconsuming [36].
The aim of this paper was the use of an ANN and multiple linear regression (MLR) to predict electrical conductivity (EC), a dependent variable, using independent variables. EC was selected as the most appropriate indicator of water quality in this paper since the GallikosRiver basin (northern Greece) is an area subject to intense anthropogenic agricultural activities, and it has a complex geological structure.
Descriptive statistics and multivariate statistics were enabled to identify the main hydrogeological and hydrochemical processes of the area. These methods were used to reveal the hidden relationships among variables and determine the parameters that were used as inputs in the ANN and MLR models. Finally, MLR and its prediction abilities were compared with the ANN models. To check the prediction accuracy, the coefficient of determination (R 2 ), mean absolute percentage error (MAPE, %), root mean squared error (RMSE), and modeling efficiency (EF) were used to select the best predictive model. To the best of our knowledge, there have been no prior studies in the GallikosRiver basin implementing prediction of EC using ANN and MLR models. This study introduces the coupling of different statistical methods, along with prediction tools, establishing a methodology that could be applied in the same area for the prediction of other parameters or in any other area of the world. This paper is structured as follows. Section 2 describes the characteristics of the study area. Section 3 explains the methodology and the theoretical background of the models. Section 4 illustrates and explains the results. Section 5 concludes the paper.

Study Area
The study area was the Gallikos River basin (868 Km 2 ) in northern Greece ( Figure 1). According to Mattas [37], 90% of the total area lies below an altitude of 600 m, while the mean altitude is 357.7 m. The length of the river within the boundaries of the study area is approximately 48 km. The mean annual precipitation over the basin is approximately 480 mm [37]. According to the Hellenic National Meteorological Service, the climate of the studied area is cold semi-arid (Bsk) [38]. The Gallikos basin belongs to the Serbo-Macedonian massif, Circum-Rhodope belt, and the zone of Peonia [39][40][41][42]. A vast area is filled with Quaternary fluvio-lacustrine sediments, due to the existence of the river, and Tertiary formations consisting of marls. The bedrock of the basin is formed from argillaceous schists, carbonate rocks (from limestones to dolomites), quartzites, amphibolites, and gneisses. In the study area, there are no significant surface storage constructions, and the majority of the irrigation demands are covered by groundwater. The main cultivations in the area are corn, tobacco, cotton, sunflower, cereals for forage, trees (mainly almonds and oil-producing olives), and vegetables [43]. Approximately 77% of pumped groundwater is used for irrigation according to the approved River Basin Management Plan-River Basin District GR10-Central Macedonia [44]. Two main aquifer systems have developed in the area. A granular system is formed in the sediments of the basin, and a fractured aquifer system exists in the crystalline rocks of the northeast part. There are two karstic aquifers that are smaller but are of great importance, since they provide good quality water for the drinking demands of the residents in the wider area [44]. Increased concentrations of nitrates (>50 mg/L), sodium (>200 mg/L), and chlorides (>250 mg/L) have been recorded in groundwater samples and are attributed to anthropogenic activities related to the agricultural and industrial sectors [45][46][47][48].

Water Sampling and Analysis
Hydrochemical data from 233 groundwater samples from 89 sampling points were examined and utilized for statistical treatment, multivariate statistics, multiple linear regression, and artificial neural network models. The IBM SPSS Statistics 25 software was used. Samples were collected over four different sampling periods (wet and dry periods in the years 2004 and 2005). In addition, 18 samples were collected from selected wells in the wet period of 2006. These samples were used as the verification dataset, to check the reliability of the MLR and ANN models. The sampling points had an adequate spatial distribution. In situ measurements of pH and EC were carried out, and the water samples were filtered through 0.45 lm membrane filters. Each sample was refrigerated at 4 • C in the laboratory. Extra samples were collected and acidified at pH\2 using HCl. All analyses were conducted according to the Standard Methods for the Examination of Water and Wastewater [49] at the laboratory of Land Reclamation Department of the Soil and Water Resources Institute, which is accredited based on ELOT EN ISO/IEC 17025.

Multivariate Statistical Analysis
Multivariate statistical techniques can identify the factors that determine groundwater quality and are considered a reliable tool for finding pollutant sources and distinguishing anthropogenic or geogenic origins [50][51][52]. Multivariate statistical techniques, such as factor analysis, are widely employed in environmental studies [53,54]. One of the techniques commonly applied to identify the relationship between water quality parameters is the factor analysis method. In the present study, R-type factor analysis was performed. Selection of the input parameters for successful forecasting using an artificial neural network is crucial. The factor analysis outcomes were used to select the most suitable variables for the implementation of the ANN model.

Artificial Neural Networks
ANNs are used as a supplementary method to conventional statistics, contributing as an ultimate objective the elaboration and storage of the experimental knowledge and its modification into a useful form for the user to handle [55][56][57]. A typical ANN consists of artificial processing elements, called neurons or nodes, which interact with each other through synapses (see Figure 2).
The neurons are grouped in layers, and the encoding information is achieved during the process of training and learning. This structure is a widely used model in hydrogeology applications with the ability to recognize patterns among parameters. The most efficient transfer functions are the sigmoid logistic function and the hyperbolic tangent function, which are implemented in most ANN models [58]. Supervised training is based on an "external teacher" that provides the target value for each training phase. The model learns to adjust the synaptic weights, taking into consideration the targets. The objective is to minimize the error by searching for the optimal weights [59]. A standard statistical criterion that is used to evaluate an ANN's performance is the mean squared error (MSE), which compares the predicted output with the desired output and the coefficient of determination (R 2 ). In the present study, a feed-forward, supervised, back-propagation learning algorithm ANN model was used for predicting the EC of groundwater, using data from the years 2004-2005, in the GallikosRiver basin. The ANN model consisted of one input layer with four elements (i.e., Ca, Mg, Na, and Cl), one hidden layer including three nodes, and the output layer where the EC value was calculated. Of the total sample, 80% was used for training and 20% for testing. The specific artificial neural network structure was selected because it showed the best performance after using the "trial and error" method by modifying the input parameters (number of hidden neurons, number of nodes, percentages of the training-testing sample sets, etc.).

Multiple Linear Regression
Multiple linear regression (MLR) is considered a very useful and accurate tool that provides equation linking between a dependent variable and a number of independent variables that act as predictors [60].
Different authors have successfully applied this method in hydrogeology and hydrochemistry to predict water quality [60,61] or to establish a statistical model [62]. In the present paper, MLR was employed to provide the equation to predict electrical conductivity. The predictors were selected after implementing the Pearson correlation coefficient, since selecting the appropriate predictor variables is necessary to improve the prediction level and minimize the required dataset [63]. The correlation coefficient (Pearson) is a statistical tool that is widely used to measure and establish the interrelationship and coherence pattern between two variables [63,64]. The advantage of MLR compared to ANNs is that it can provide an equation.

Performance Evaluation of the Models
The performance evaluation and, hence, the forecasting ability of the models was evaluated using the following statistical indexes: The coefficient of determination (R 2 ) gives the percentage variation of variables on the y-axis, explained by variables on the x-axis. The range is from 0 to 1; The mean absolute percentage error (MAPE, %) is a measure of prediction accuracy of a forecasting method, defined by Equation (1): where A t is the actual value, F t is the forecast value, and n is the number of samples; The root meansquare error (RMSE) is the square root of the mean of the square of the total error (Equation (2)): where O i are the observations, S i are the predicted values of a variable, and n is the number of observations. Thus, RMSE is a good measure of accuracy, but only to compare prediction errors of different models or model configurations for a particular variable and not between variables [65]; Modeling efficiency (EF) is used to compare predicted versus observed values (Equation (3)). A value equal to 1 indicates a perfect model performance. Generally, values that range between 0 and 1 indicate that the values predicted by the model's results are more appropriate for use than the mean value of a dataset, and negative values are worse [66]: where O i are the observations, S i are the predicted values of a variable, and n is the number of observations. A high R 2 and EF, a low MAE and MAPE, and a low RMSE indicate good model performance.

Results and Discussion
The results from the descriptive analysis of the samples for each period are presented in Table 1. The mean value of the nitrates was equal to 38.8 and is considered relatively high, and this cannot be attributed to natural causes. Given that this is an agricultural area, the high values are mainly related to the use of fertilizers and to the lack of a sewerage network for the settlements scattered in the study area during the sampling time periods. The GallikosRiver basin has been characterized as an area vulnerable to nitrate pollution from agriculture. Guidelines for fertilization practices that should be implemented for the protection of the water resources according to crop type, soil slope, and classification are described in the Official Government Gazette of the Hellenic Republic n.1496/v.2/3-05-2019 http://www.et.gr/idocs-nph/search/pdfViewerForm.html?args=5C7QrtC22wFqnM3 eAbJzrXdtvSoClrL8JfWk9tSupxYfP1Rf9veiteJInJ48_97uHrMts-zFzeyCiBSQOpYnTy36Mac mUFCx2ppFvBej56Mmc8Qdb8ZfRJqZnsIAdk8Lv_e6czmhEembNmZCMxLMtUhnwnTxy ShEgwBm79OuvSkRyUUHxgps8WhFndSwtJl1 (Last accessed: 24 August 2021).The maximum values of many samples exceeded the maximum permissible value for potable water, set by the World Health Organization and National Legislation, for the following parameters: EC (7 samples), Na (16 samples), Cl (21 samples), and NO 3 (57 samples). This can be attributed to the operation of fabric dyeing units during the sampling period for Na and Cl, and fertilizers for NO 3 , as aforementioned. Except for the pH, the values of other parameters varied in a wide range, as indicated by the high values of the standard deviations due to the different conditions that prevail in the different parts of the basin.
The Pearson correlation matrix identified the influence of Ca, Mg, Na, and Cl on EC, finding a significantly positive correlation (Table 2). The factor analysis was valid for the four periods, since the Keiser-Meyer-Olkin coefficient had a value of 0.681 (>0.5). At each period, three factors showed eigenvalues higher than 1, based on the selection criteria. These factors explain more than 68.2% of the total variance, which is statistically significant.
The results (Table 3) showed that Na, Cl, Ca, Mg, and EC participated in the first factor, revealing that the main processes defining groundwater quality are pollution from industrial activities in the area and carbonate rock dissolution. Nitrate pollution due to the agricultural activities did not have a strong impact on the EC value, since the nitrates and potassium from agricultural pollution participate in the second factor [67]. The participation of SO 4 in the third factor can also be attributed to pollution from agriculture due to the fertilization [68,69].
Therefore, EC cannot be used for the detection of agricultural pollution.
A farmer's income depends on the crop yield which, in turn, relies on irrigation water availability and quality [70]. Application of saline irrigation water causes degradation of soil fertility, and crop problems can develop [71]. Factor analysis can be effectively employed to identify the main factors that affect irrigation water quality.
After the evaluation of the results using descriptive and multivariate statistics, the impact of Ca, Mg, Na, and Cl (independent variables) on electrical conductivity (dependent variable) were established.
The implementation of multiple linear regression using the outcomes of the correlation coefficient and factor analysis resulted in Equation (4)  The MLR method results revealed that the prediction of the dependent parameter (EC) using the parameters that were indicated by the Pearson coefficient was valid, since the coefficient of determination (R 2 ) was statistically significant (0.94). As depicted in Figure 3, the measured versus predicted values using the MLR method were close to the 1:1 axis. In Figure 4, the error values are plotted very close to the horizontal axis.    In Table 4, the coefficient of determination, efficiency model, mean absolute percentage error, and root mean square error were calculated based on the results of the MLR and ANN models. The values of these indices were considered statistically significant, verifying that forecasting of electrical conductivity using Ca, Mg, Na, and Cl values was valid for the examined data set for both methods. The high R 2 values and the high EF values of both models indicate that they provided a reliable prediction of the EC, along with the small MAPE (%) and RMSE values. In addition, the comparison of the indices values highlights that the performances of ANN and MLR were similar on this large dataset, which included 233 samples taken from sampling points scattered around a large area with different geological conditions and land uses. In order to verify the accuracy and reliability of the two methods, a small dataset of 18 samplescollected during the wet period of 2006 was used.
The results of the MLR and ANN are depicted in Figures 7 and 8, respectively.  The evaluation criteria of the models' performance, depicted in Table 5, verify that the predicted values of EC are valid, but for a small set of data, the performance of the MLR than that of the ANN. The dependent variable (EC) was explained better in the MLR model by the independent variables (i.e., Ca, Mg, Na, Cl), since the coefficient of determination was much higher and the mean absolute percentage error was significantly smaller. Forecasting models are very useful tools for water managers and can be used to predict the water quality with respect to changes in hydrological and hydrogeological regimes, showing better performance than traditional statistical methods [72,73]. With the use of these models, complex data as a result of various natural or human processes are easily transformed into practical and understandable information for scientists, stakeholders, and policy makers involved in water management or even for the general population [74].

Conclusions
The aquifers within the boundaries of the GallikosRiver basin have developed in an area with intensive agricultural activities and small-scale enterprises, receiving different types of pollutants. Agriculture determines the economy of the area and, hence, farmer's income, since it constitutes the most important employer. Crop yield and soil quality depend on irrigation water quantity and quality. Therefore, special management practices may be required.
Irrigation water salinity, which is a measure of quality, can be described through electrical conductivity. Artificial neural networks and multiple linear regression are commonly used with great success in the prediction of water parameters due to the fact of their good performance, simplicity, and low data requirements. This was the motivating factor for their application to the present study.
Samples collected during a three-year experimental period (2004)(2005)(2006) were used for the calibration, validation, and evaluation of the models. The multiple linear regression and artificial neural networks models had similar performances in the case of a large dataset (233 samples). Both models provided reliable results, since all the evaluation indices that were used were statistically valid (R 2 > 0.927, EF > 0.93, MAPE (%) < 14.5). In the case of implementing the two models on a smaller verification dataset (18 samples), the forecasting ability remained statistically significant (R 2 > 0.75, EF > 0.976, MAPE (%) < 20) for both, but the MLR method achieved a better performance. Factor analysis is a suitable method for the selection of the input parameters for the MLR and ANN models, based on the evaluation of their accuracy and reliability.
The outcomes of this research in the specific case study area have practical importance, since the in situ measurement of EC is time consuming and costly. According to this study, these measurements could be avoided. The methodology followed in this study could be used as an effective tool for quality parameter forecasting in any other region that faces environmental problems. This study provides the necessary steps and techniques for parameter selection and model performance evaluation.

Conflicts of Interest:
The authors declare no conflict of interest.