Susceptibility Prediction of Groundwater Hardness Using Ensemble Machine Learning Models

: Groundwater resources, unlike surface water, are more vulnerable to disturbances and contaminations, as they take a very long time and signiﬁcant cost to recover. So, predictive modeling and prevention strategies can empower policymakers for e ﬃ cient groundwater governance through informed decisions and recommendations. Due to the importance of groundwater quality modeling, the hardness susceptibility mapping using machine learning (ML) models has not been explored. For the ﬁrst time, the current research aimed to predict groundwater hardness susceptibility using the ML models. The performance of two ensemble models of boosted regression trees (BRT) and random forest (RF) is investigated through the arrangement of a comparative study with multivariate discriminant analysis (MDA). According to the hardness values in 135 groundwater quality monitoring wells, the hard and soft water are determined; then, 11 predictor variables including distance from the sea (DFS), land use, elevation, distance from the river (DFR), depth to groundwater (DTGW), pH, precipitation (PCP), evaporation (E), groundwater level (GWL), curvature, and lithology are used for predicting the groundwater hardness susceptibility map. Results indicated that the variables of DFR, DTGW, elevation, and DFS had a higher contribution to the modeling process. So, the high harness areas are mostly related to low elevations, low DTGW, and proximity to river and sea, which facilitate the percolation conditions for minerals containing calcium or magnesium into groundwater.


Introduction
Groundwater is the fundamental source of drinking water and agriculture irrigation for at least half of the world's population [1]. From this perspective, the contribution of groundwater for implementing the United Nations' Sustainable Development Goals (SDGs) is reported to be

Study Area
The study area is situated in the east of Mazandaran province, Iran, with an area of about 3297 square kilometers which expands from longitudes 52 • 36 to 53 • 23 E and latitudes from 35 • 44 to 36 • 46 N (Figure 1). The Ghaemshahr-Joibar plain is surrounded from the north with the Caspian Sea, from the east with the Siahrood River, from the west with the Thalar River, and from the south with the Alborz mountains. This plain is located in the Talar and Siahrood River Basins which reaches in the north to the Caspian Sea. Because of the Caspian Sea adjacent to this plain, the climate is categorized as Caspian mild with a minimum temperature of 6 • C in winter and the average maximum temperature of 25 • C in summer times [51]. The rainfall is about 700 mm in the southern and 850 mm in the northern part of the plain, in which the wet period of the year occurs in November and December. There are agricultural and cultivation activities in this plain with the predominant cultivation of rice; vegetables, colza, broad bean, and clover are the other crops that are commonly cultivated in winter months of the year [52]. Groundwater is among the main sources of water supply; more than half of drinking water demand is provided by groundwater resources such as springs and wells. The use of groundwater in the agricultural sector is the highest with 87.8%, then it is followed by drinking needs (11.4%), and the lowest percentage is for the industrial sector with 0.9% [53].
Water 2020, 12, x FOR PEER REVIEW 3 of 17

Study Area
The study area is situated in the east of Mazandaran province, Iran, with an area of about 3297 square kilometers which expands from longitudes 5236 to 5323 E and latitudes from 3544 to 3646 N ( Figure 1). The Ghaemshahr-Joibar plain is surrounded from the north with the Caspian Sea, from the east with the Siahrood River, from the west with the Thalar River, and from the south with the Alborz mountains. This plain is located in the Talar and Siahrood River Basins which reaches in the north to the Caspian Sea. Because of the Caspian Sea adjacent to this plain, the climate is categorized as Caspian mild with a minimum temperature of 6 °C in winter and the average maximum temperature of 25 °C in summer times [51]. The rainfall is about 700 mm in the southern and 850 mm in the northern part of the plain, in which the wet period of the year occurs in November and December. There are agricultural and cultivation activities in this plain with the predominant cultivation of rice; vegetables, colza, broad bean, and clover are the other crops that are commonly cultivated in winter months of the year [52]. Groundwater is among the main sources of water supply; more than half of drinking water demand is provided by groundwater resources such as springs and wells. The use of groundwater in the agricultural sector is the highest with 87.8%, then it is followed by drinking needs (11.4%), and the lowest percentage is for the industrial sector with 0.9% [53].

Dataset
Datasets in this study are divided into dependent (output) and independent (inputs) data, respectively, including groundwater hardness values and geo-environmental factors:

Dataset
Datasets in this study are divided into dependent (output) and independent (inputs) data, respectively, including groundwater hardness values and geo-environmental factors:

Hardness Data
Hardness data from 135 groundwater monitoring wells ( Figure 1) are received from the Iranian Water Resources Management Company (IWRMC) from 2001 to 2016. The average value of hardness data was calculated for each well, and the hard/soft water was determined using the World Health Organization (WHO) guideline considering a threshold equal to 500 mg/L [54].

Environmental Factors
There are some conditions and factors, especially environmental factors, which affect groundwater resources, and these factors have an essential role in sustainable groundwater management. Table 1 indicates the list of the considered factors in this study, in which each factor is described as follows: Elevation: Elevation is a key environmental factor that influences the water surface and the groundwater flows [55]. In other words, there is an inverse relation between elevation and infiltration and recharge, so that the values of both of them will be lower at the higher altitudes [56]. Moreover, elevation changes with different climatic features will also affect soil and cover plant conditions [57,58]. Curvature: Curvature as one of the morphometric characteristics has a considerable effect on the divergence and convergence flowing pattern; so, what is remarkable in its role on the recharge potential of groundwater [59,60]? The DEM was used to derive the curvature map in ArcGIS software. The range of curvature in the study area was between −18 and 25.74 ( Figure 2b).
Distance from sea (DFS): The Ghaemshahr-Joibar plain is near to the Caspian Sea; therefore, calculating DFS is an important parameter to detect the quality and hardness of groundwater. The DFS affects the hydraulic status among seawater and aquifer as well as makes changes in different elements of water balance. The map of the DFS was prepared by Euclidean Distance in ArcGIS (Figure 2c).
Distance from river (DFR): Siyahrood and Talar are the major rivers in the study area; hence, infiltration and charge from them are important. The map of DFR was calculated by the Euclidean Distance function in ArcGIS ( Figure 2d).
Precipitation (PCP): By the infiltration of rain into the groundwater, the pressure of carbon dioxide due to oxidation of volatile organic matter was increased [61]. Besides increasing the pressure of CO2, the pressure of carbonic acid (HCO 3 − ) and dissolution of the carbonate minerals were raised.
Thus, the result of this process increased the Groundwater Hardness. Annual precipitation data of meteorological stations from 2001 to 2016 are obtained from the Iranian Meteorological Organization (IRIMO, http://www.irimo.ir/). The range of long-term mean precipitation value in the study area was recorded from 770 to 824 mm ( Figure 2e). Evaporation (E): Evaporation is a key reason for the discharge of groundwater in the arid region with low elevation and shallow depth of groundwater [62]. So, the evaporation process by influencing the groundwater level causes the initial quality water to change. The data of yearly evaporation (from 2001 to 2016) were obtained from the IRIMO. The range of long-term mean evaporation was from 360 mm to 1139 mm ( Figure 2j).
Depth to groundwater (DTGW): DTQW (or unsaturated soil zone) has a significant effect on the hydrological process as infiltration capacity and runoff [63]. The DTGW data were collected from monitoring wells in the study area which were obtained from the Iranian Water Resource Management Company (IWRMC). In this study, the DTGW map was produced in ArcGIS, and according to Figure 2g, the highest value of that was 34.5m.
Groundwater level (GWL): GWL affects the quality of water, and by decreasing it, the quality also will be reduced [64]. Changes in GWL are directly affected by human activities and climate factors such as evaporation and precipitation. GWL data were obtained from all of the wells observed by the IWRMC. The range of this map was from −44.2 to 63.2 m (Figure 2h).
pH: Due to the influence of lithological and distance from sea parameters, pH causes a change in the quality and hardness of groundwater.
Landuse: Landuse is among the factors that have an impact on the surface, subsurface and groundwater flow, and evapotranspiration [65][66][67]. Overuse of land due to the increasing agricultural and urban land use with population growth will cause changes in groundwater quality and hardness. The land use map of the Ghaemshahr-Joibar plain includes six classes of agriculture, dry farming, forest, orchard, rangeland, and urban ( Figure 2f).
Lithology: Due to different potential of permeability and stability of rocks and soil, hydrological parameters are influenced by lithological and geomorphological structures [42,68]. Moreover, the morphology of lithological composition and the dissolution of their minerals play an important role in increasing groundwater elements [69]. The weathering process of sedimentary rocks and clay minerals is an effective factor in groundwater quality and hardness. A 1:100,000-scale geological map was used to prepare the lithology map (Figure 2k). Water 2020, 12, x FOR PEER REVIEW 6 of 17

Modeling Procedure
After preparing the input data, the multicollinearity analysis was performed to remove probably collinear variables from the modeling process. The Variance Inflation Factor (VIF) was used to investigate the rate of multicollinearity between ith independent variables and other independent variables. When the VIF is greater than 10, try to remove one or two factors often or combine some independent factors [70,71]. After ensuring the lack of multicollinearity between variables (see the results in Table 2), the k-fold cross-validation method was used to calibrate the models. In this method, basic data samples were divided into K same mutual subsets based on hierarchical sampling. K-1 subsamples were applied as training the model, and the remaining subsample was used for evaluation of the results [72]. Then, the validation process was repeated K times (equal to the number of subsets). Despite any intense rule for the value of K in this method, it is often reported as 10 [73].

Model Description
Machine learning models were applied to model hardness susceptibility of groundwater, which are described as follows: Boosted Regression Trees (BRT): The BRT model was firstly introduced by Freund and Schapire [74] as a powerful algorithm for continuous and categorical variables. It is an ensemble technique that works based on both strengths of regression trees (models which use the recessive binary splits to answer their predictors) and boosting algorithms (a combination of different models to modify the forecast of performance). Some parameters which have the main roles on the BRT fitting include the rate of learning, rate of bagging, complexity of tree, minimum number of observations at the end nodes, and number of trees. In comparison to other predictive methods the BRT has some advantages such as (i) managing various kinds of predictor variables, (ii) improving the missing or lost data, (iii) no need to convert or delete outlier data, and (iv) fitting and controlling the complex nonlinear interaction between variables [75]. See Freund and Schapire [74], Schapire [76], and Elith et al. [75] for more details of the BRT model.
Random Forest (RF): Breiman [77] presented a developing method of new decision trees which combined several signal algorithms using the rules. RF as a non-parametric method consists of clusters of classification and regression trees. The description of the RF is based on a set of tree-structured classifiers is given as: where Θ k are independent identically distributed random vectors, and each tree casts a unit vote for the most popular class at input x. The number of trees and the number of predictor variables are the main parameters in the RF according to which the decision tree grows to the largest possible size without being pruned [78]. It should be noted that the general principles of group training are based on the assumption that the accuracy of them is higher than other training algorithms. To make a growth tree, the RF applies the best variables or dividing points in the variable subsets which are randomly selected, so it reduces the overall error of the model [77]. More details of the RF model are described in Breiman [77].
Multivariate discriminant analysis (MDA): The MDA is introduced by Hair et al. [79]. One of the purposes of MDA is related to forecasting the group membership based on the relation of a dependent variable and its observed features of predictor collection [80,81]. The main base of the MDA is a discriminant function that makes a linear compound for independent variables. It is necessary to note that the important assumptions of the MDA are: (i) the independent parameters are in careful multivariate normal distribution; (ii) the predictors should not be strongly correlated, and their average and variance have not been accounted for; also, (iii) two predictors should have a stable correlation between groups [81]. The equation of linear discriminant is presented as follows [70]: in which Y is a discriminant score, Wi (i = 1,2,3, . . . ,n) are discriminant weights, and Xi (i = 1,2,3, . . . ,n) are independent variables. More details of the MDA model are presented in Hair et al. [79].

Performance Evaluation
To evaluate the performance of predictive models, some metrics such as the area under curve (AUC) of the receiver operating characteristics (ROC), Accuracy, and True Skill Statistic (TSS) were used. The receiver operating characteristic was used to detect the quality rate of a produced map in addition to the quality value of the forecasting model [60,67]. According to Negnevitsky [82], this metric is defined as the potential of a predictive system to accurately forecast whether a specific event had happened. Basic components of the ROC curve were used to plot the test values as false positive rate (FPR) against the training values as true positive rate (TPR). The range of AUC changed from 0 to 1, and values more than 0.8 indicate very good performance. The Accuracy shows what fraction of the predicted data is correct, and it varies between 0 and 1 [83]. The TSS was calculated from the difference between the TPR and FPR values. The value of TSS varied between 1 and −1; + 1 shows the perfect agreement, and zero and smaller amounts reveal that the performance is not normal [84]. The AUC, Accuracy, and TSS can be achieved as follows [84][85][86]: In which TP (true positive) and TN (true negative) are defined as the number of occurrences and non-occurrences, respectively, that are acceptably classified, while FP (false positive) and FN (false negative) are described as the amount of data which are incorrectly classified.

Modeling Results
After ensuring the lack of the multicollinearity problems (Table 2) among the variables, the modeling process was conducted using the k-fold cross-validation methodology. The performance of the predictive models is presented in Table 3. As can be seen, the area under curve (AUC) values for all the models was more than 80% (Table 3, Figure 3). Accuracy values for the MDA and RF models were about 83% and were higher than the BRT model (Accuracy = 78%). Additionally, True Skill Statistic (TSS) values were equal to 0.73, 0.71, and 0.59, respectively, for RF, BRT, and MDA models (Table 3).  (Table 3).

Spatial Prediction of Groundwater Hardness Susceptibility
Using all the pixel values for the whole region, the groundwater hardness susceptibility maps were predicted and classified into three classes of low, moderate, and high based on the natural breaks' classification methodology. Figure 4 indicates the predicted maps in which low classes had a higher area, respectively-2474.14, 1924.30, and 2403.51 km 2 for BRT, RF, and MDA models. Additionally, moderate class locations were covered by about 443.34, 1022.22, and 518.07 km 2 for the BRT, RF, and MDA models, respectively. The high hardness susceptibility class was equal to 379.21, 350.17, and 375.11 km 2 , respectively, for BRT, RF, and MDA models, which are mostly located in the south of the Ghaemshahr-Joibar plain (Figure 4, Table 4).

Spatial Prediction of Groundwater Hardness Susceptibility
Using all the pixel values for the whole region, the groundwater hardness susceptibility maps were predicted and classified into three classes of low, moderate, and high based on the natural breaks' classification methodology. Figure 4 indicates the predicted maps in which low classes had a higher area, respectively-2474.14, 1924.30, and 2403.51 km 2 for BRT, RF, and MDA models. Additionally, moderate class locations were covered by about 443.34, 1022.22, and 518.07 km 2 for the BRT, RF, and MDA models, respectively. The high hardness susceptibility class was equal to 379.21, 350.17, and 375.11 km 2 , respectively, for BRT, RF, and MDA models, which are mostly located in the south of the Ghaemshahr-Joibar plain (Figure 4, Table 4).

Variable Importance Analysis
The importance of the variables was calculated using the decrease in AUC (DAUC) after removing each variable from the modeling process ( Figure 5). Higher DAUC for a variable indicates the higher importance of that variable. As can be seen, variables of DFR, elevation, DTGW, and land use were identified as the most important inputs by the BRT model (respectively, with DAUC about 36%, 10%, 5%, and 4 %). Given the RF model, the variables of DFR, DTGW, elevation, and DFS (with the DAUC values of 62%, 34%, 12%, and 8%) had a higher contribution in the modeling process. Additionally, like the RF model, results of the MDA model demonstrated that the variables of DFR, DTGW, elevation, and DFS were the most important variables which had the DAUC values of 71%, 38%, 14%, and 9% ( Figure 5).
Water 2020, 12, x FOR PEER REVIEW 11 of 17

Variable Importance Analysis
The importance of the variables was calculated using the decrease in AUC (DAUC) after removing each variable from the modeling process ( Figure 5). Higher DAUC for a variable indicates the higher importance of that variable. As can be seen, variables of DFR, elevation, DTGW, and land use were identified as the most important inputs by the BRT model (respectively, with DAUC about 36%, 10%, 5%, and 4 %). Given the RF model, the variables of DFR, DTGW, elevation, and DFS (with the DAUC values of 62%, 34%, 12%, and 8%) had a higher contribution in the modeling process. Additionally, like the RF model, results of the MDA model demonstrated that the variables of DFR, DTGW, elevation, and DFS were the most important variables which had the DAUC values of 71%, 38%, 14%, and 9% ( Figure 5).

Discussion
The current research considered two ensemble machine learning models namely boosted regression trees (BRT) and random forest (RF) for susceptibility mapping of groundwater hardness, for the first time. Results of the ensemble models were compared with the model of multivariate discriminant analysis (MDA). According to the modeling results, the RF, BRT, and MDA models, respectively, indicated better performance. The RF model may use some data more than once in a

Discussion
The current research considered two ensemble machine learning models namely boosted regression trees (BRT) and random forest (RF) for susceptibility mapping of groundwater hardness, for the first time. Results of the ensemble models were compared with the model of multivariate discriminant analysis (MDA). According to the modeling results, the RF, BRT, and MDA models, respectively, indicated better performance. The RF model may use some data more than once in a training dataset, whereas some inefficient data will never be used. Therefore, it has more stability and a higher performance to predict in comparison to other methods [77].
To the authors' knowledge, although there is no previous study which used these models in this field, it should be noted that the good performance of the RF model has been demonstrated in other environmental fields such as groundwater nitrate prediction [39], earth fissure hazard prediction [87], and flash-flood hazard assessment [73]. Like our results, a study by Choubin et al. [88] demonstrated that the RF model had a superior performance than the MDA model for the prediction of air particulate matter (PM) hazard.
However, the Ensemble models such as BRT and RF have some advantages rather than other individual models. They can manage different types of predictor variables (e.g., both continuous and classification variables), improve the missing or lost data, have no need to convert or delete outlier data, have a lack of pre-analysis to select variables among a large number of predictors, increase the diversity of classification trees through the random selection of predictive variables over the different trees, and fit and control the complex nonlinear interaction between variables, and fitting the multiple trees in these models overcomes the biggest drawback (i.e., poor predictive performance) of single models [75,[89][90][91].
Spatial prediction of groundwater hardness susceptibility indicated that the high harness areas mostly have low elevations, low depth to groundwater (DTGW), are located near to Caspian Sea, have the Qm (swamp and marsh) lithology unit, and correspond with the agricultural areas. However, the percolation conditions for minerals containing calcium or magnesium into groundwater in these areas are higher and easier due to the low unsaturated zone (i.e., low DTGW), proximity to seawater, low elevations with more drainage and more water accumulation, and irrigated water.
Although the considered period in this study was relatively short (17 years), the groundwater quality and hardness may be affected by climate change and land use change for a longer period. Therefore, in future studies, it is recommended to consider these issues for providing susceptibility maps.

Conclusions
The main objective of this study was to model the hardness of groundwater using machine learning models and geo-environmental factors. The predicted hardness susceptibility maps indicated that the areas located in the south of the region have high hardness susceptibility. Some of the possible reasons might include the following: (i) the low depth to groundwater, (ii) proximity to seawater, (iii) low elevations with more drainage and more water accumulation, and (iv) irrigated water, all of which facilitate the percolation conditions for minerals containing calcium or magnesium into groundwater. However, the actual relationships between the geo-environmental factors and groundwater hardness need further in-depth research. Although results from this study were promising, the use of more related and available parameters as input such as soil information may affect and improve the results of the hardness modeling in other regions. Results from this study can help policymakers to understand the groundwater quality, concerning the water hardness and managing the freshwater for sustainable development. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.