Soft Computing Ensemble Models Based on Logistic Regression for Groundwater Potential Mapping

: Groundwater potential maps are one of the most important tools for the management of groundwater storage resources. In this study, we proposed four ensemble soft computing models based on logistic regression (LR) combined with the dagging (DLR), bagging (BLR), random subspace (RSSLR), and cascade generalization (CGLR) ensemble techniques for groundwater potential mapping in Dak Lak Province, Vietnam. A suite of well yield data and twelve geo-environmental factors (aspect, elevation, slope, curvature, Sediment Transport Index, Topographic Wetness Index, flow direction, rainfall, river density, soil, land use, and geology) were used for generating the training and validation datasets required for the building and validation of the models. Based on the area under the receiver operating characteristic curve (AUC) and several other validation methods (negative predictive value, positive predictive value, root mean square error, accuracy, sensitivity, specificity, and Kappa), it was revealed that all four ensemble learning techniques were successful in enhancing the validation performance of the base LR model. The ensemble DLR model (AUC = 0.77) was the most successful model in identifying the groundwater potential zones in the study area, followed by the RSSLR (AUC = 0.744), BLR (AUC = 0.735), CGLR (AUC = 0.715), and single LR model (AUC = 0.71), respectively. The models developed in


Introduction
Groundwater is one of the most valuable resources of water supply for agricultural, urban, and industrial activities in many parts of the world [1].Groundwater contributes to the economic development and biodiversity of an area [2].The effective management of groundwater is critical for balancing different demands on water.Groundwater management is becoming progressively challenging due to rapid population growth and occurrences of intermittent and prolonged drought periods [3][4][5][6] that have mounted increasing pressure on groundwater resources worldwide [7].Therefore, further research on the exploration and estimation of groundwater potential and resources is necessary for proper water management of an area.Among different management strategies, potential groundwater mapping is an effective approach that can assist managers to adopt more efficient management plans [8][9][10][11][12].
Various methods have been suggested and used for producing groundwater potential maps.They range from the traditional surface and sub-surface potential assessment methods to the recent predictive models derived by machine learning methods.The surface and sub-surface methods include esoteric, geomorphologic, geological, soil and micro-biological, surface geophysical test drilling of boreholes, and geophysical logging techniques [13].Although these methods are highly efficient for obtaining an accurate estimate of groundwater potential, the lack of advanced technologies and sufficient hydrogeological data preclude their use in many countries.
Recent advancement in groundwater potential mapping is the combined use of different machine learning methods towards developing hybrid ensemble machine learning models for obtaining the most accurate results.Chen et al. [23] proposed the integrated application of J48 decision trees with the random subspace (RSS), rotation forest (RF), AdaBoost, bagging, and dagging to identify the groundwater potential zones in Wuqi County, China.Naghibi et al. [14] improved performances of the BRT, CART, and RF classifiers using a rotation forest ensemble technique for modeling groundwater potential in Meshgin Shahr, Iran.Miraki et al. [24] developed an ensemble model based on the RF classifier and the RSS ensemble technique to map the groundwater potential in Kurdistan, Iran.Avand et al. [25] integrated the best first decision tree with the AdaBoost, multiboosting, and bagging ensembles for groundwater potential mapping in the Yasuj-Dena area, Iran.Al-Fugara et al. [26] developed a hybrid model based on the SVM and genetic algorithm (GA) for groundwater potential mapping in Jerash and Ajloun, Jordan.In another study, Naghibi et al. [19] used GA for optimizing the structure of the SVM and RF models for groundwater potential mapping in Iran.Khosravi et al. [27] used several metaheuristic algorithms for optimizing a neuro-fuzzy model for groundwater potential mapping in Lorestan, Iran.Banadkooki et al. [28] used the whale optimization algorithm for optimizing the ANN for groundwater potential mapping.All these studies demonstrated the enhanced predictive performance of the hybrid ensemble models compared to the single models.In fact, the premise of the application of a hybrid ensemble is that groundwater potential mapping requires big data of various geo-environmental variables [11,12,23,24] that largely make the single modeling approaches inefficient in many regions [29].
In this study, we developed four ensemble models for groundwater potential mapping.Each model consisted of a logistic regression combined with an ensemble learning technique, namely dagging, bagging, cascade generalization (CG), and RSS.The development and application of these models were underpinned by real-world data from Dak Lak Province, Vietnam, where groundwater is the main source of water supply for domestic use and agriculture.The reliability and accuracy of these models were evaluated by comparing their area under the receiver operating characteristic (ROC) curves (AUC) and the index values derived from several other validation methods.

Logistic Regression (LR)
LR is the most widely used empirical model in different fields of science, in particular for environmental studies [30][31][32][33].In LR, the probability of occurrence of a phenomenon is estimated within the range of 0 to 1, and it is not necessary to assume the normality of the predictor variables.Binomial LR analysis is used when the dependent variable is at the binomial nominal level, and it is used to predict the presence or absence of an attribute based on a set of independent variables.In linear double regression, one variable is used to predict another variable (such as temperaturealtitude prediction), while in multiple LR, the relationship between several independent variables is measured with one dependent variable.LR is a special type of multiple regression in which the dependent variable is discrete.In fact, the LR model describes the relationship between a twodimensional response variable (the presence or absence of a variable) and a set of response variables.The response variable may be continuous or discrete and does not require a frequent distribution.

Bagging
The name bagging is derived from bootstrap aggregating, which is one of the first, most intuitive, and simple ensemble-based algorithms with excellent performance [34][35][36][37].The diversity of classifiers is received through bootstrap copies of the training dataset.Different subsets (with placement) are selected from the whole training set.Each subset is then utilized to train a different classifier.The individual classifiers are thereafter combined with the majority vote in their decisions.The class selected by the largest number of classifiers is the final ensemble decision.Since training datasets may overlap, additional measures may be used to increase diversity, such as using a subset of training data to train each classifier, or using relatively weak classifiers such as decision stumps.The batch method is only effective in nonlinear models of instability where small changes in training data cause large changes in their classification and accuracy.Reducing variance in the results of unstable learners reduces the final error [38].

Dagging
Witten and Ting proposed dagging for the first time in 1977.Dagging uses some disjoint samples which replace bootstrap samples to derive basic variables.Dagging does not use the bootstrap samples for the extraction of basic classification [39].Dagging uses certain samples for the basic classification extraction.For grouping the classifications, majority voting is used in the dagging technique.

Cascade Generalization (CG)
One of the effective algorithms for classifier fusion is the CG algorithm, which works on a metalevel [40].In this algorithm, predictions of basic level classifiers are used to increase the dimension of the input space.To do this, the output of each base-level classifier is added as a new attribute for each of the training examples.Thus, base-level and meta classifiers use the main input features for training, while meta-level classifiers can also use additional features (predictors of base-level classifiers).

Random Subspace (RSS)
RSS was proposed in 1988 for improving the accuracy of the weak classifications and the individual classification performance.RSS is one of the most popular methods of random sampling [41,42], in which the original character varies randomly.RSS creates multiple subspaces with small dimensions and then uses a majority vote for grouping the characteristic series of each subclassification formation [42,43].RSS has been used in several fields such as economics [44] and medicine [45], but rarely in problems of determining groundwater potential.

Statistical Indices
Numerous statistical indices have been used in different studies to evaluate the performance of predictive models [46,73,74].In this study, we used positive (TP), true negative (TN), false positive (FP), false negative (FN), positive predictive value (PPV), negative predictive value (NPV), sensitivity (SST), specificity (SPF), accuracy (ACC), root mean square error (RMSE), and Kappa for the comparison and validation of the models.The equation for each one of these indices is as follows [75][76][77][78][79][80][81][82]: = + (2) where n is the total of the variable, Vpredicted and Vactual are like the prediction and actual value of variable i-th, Pa is the relative observed agreement among raters, and Pest is the hypothetical probability of chance agreement.

OneR Feature Selection Method
To determine the groundwater potential using machine learning, it is necessary to select the appropriate predictive variables as the model inputs.In this study, we used the OneR feature technique for ranking the predictive variables and selecting those factors that contribute the most to groundwater potential.This process can efficiently decrease noise and the over-fitting problem of the modeling process [83,84], leading to an increased quality of the results [46].

Study Area
Dak Lak Province (12°9'45" to 13°25'06" north latitude; 107°28'57" to 108°59'37" east longitude) is located in the central highlands of Vietnam (Figure 1).The province covers about 13,085 km 2 , which accounts for 3.9% of the area of Vietnam.The topography of the study area is predominantly mountainous.The elevation in the area varies from 400 m to 2442 m (Chu Yang Sin Peak).A flat stretch of highland is in the middle part of the province, covering about 50% of the area.The climate of the province varies from hot under an elevation (El) of 400 m, humid at 400 m (El), and cold over 800 m (El).There are two main seasons: dry (November to April) and rainy (May to October) with 90% annual rainfall.
One-third of Dak Lak Province is covered by basalt rock and associated soil, which are favorable for developing rubber, coffee, and pepper plantations.About 660 million m 3 of water is needed for 264,000 ha of coffee area, while availability of the surface water is only 250 million m 3 .During the dry months, the estimated amount of groundwater exploited for irrigation is about 500,000 m 3 /day.Due to climate change and El Nino-effect changes of the rainfall pattern in the central highland region, more dependency on the groundwater resources for the socio-economic development of the province is required.Therefore, the balance requirement needs to be met by groundwater through proper groundwater management.Thus, groundwater potential mapping is required, using advanced technology for the groundwater assessment and identifying an area suitable for the recharge.

Well Yields
Groundwater potential assessment is generally done by the evaluation of well yields data [14,19,85,86].In this study, the well yield data of 227 drilled wells was obtained from the national project of the Vietnam Academy for Water Resources (VAWR) and used for model development.The well data included: (1) water level at a non-operated and operated situation, (2) structure of the wells, (3) measurement of temperature and TDS (total dissolved solids), (4) yield, and (5) location.This dataset was randomly divided into separate sets such that 70% of the data was used for model building and the remaining data (30%) was used for model validation.The yield value of 1.6 l/s was selected as the threshold value for distinguishing between potential and non-potential areas of Dak Lak Province in terms of the groundwater potential.

Groundwater Influencing Factors
Selection of the groundwater influencing factors depends on the topography, geo-environment, meteorology, and anthropogenic activities of the area being investigated.In the present study, twelve influencing factors, namely soil type, geology, elevation, slope, Topographic Wetness Index (TWI), aspect, curvature, Sediment Transport Index (STI), river density, flow direction, land use, and rainfall were selected for groundwater potential mapping.A 30-m resolution Digital Elevation Model (DEM) collected from USGS (https://earthexplorer.usgs.gov/)was used to extract the maps of the slope, aspect, curvature, elevation, STI, TWI, flow direction, and river density factors.
The land use map was obtained from Dak Lak's Department of Natural Resources and Environment (DARD), Vietnam, at the scale of 1:50000.Geology and rainfall maps were extracted from the hydrogeological map (1:300.000scale) collected from the central region of the Vietnam Division for Water Resources Planning and Investigation, Vietnam.
An aspect map, which indicates relationship with the ability to accumulate and retain water on the surface, was prepared and classified into nine classes (Figure 2a).Runoff will be greater on a convex surface and accumulation of water on a concave surface.The curvature in the study area ranged from −23.5 to 30.8 (Figure 2b).Runoff also depends on the meteorological conditions and elevation of the area.In the study area, the elevation ranged from 117 to 2424 m (Figure 2c).Flow direction is another important factor as it affects runoff and infiltration, and ranged in this area from 1 to 255 (Figure 2d).Slope has a direct relationship with the hydrological process, where surface water will accumulate, and thus more infiltration leads to groundwater recharge (Figure 2e).
A land use map is also a very important influencing factor for the assessment of groundwater potential that reflects the local conditions and anthropogenic activities in the area.The land use map was classified into different classes (Figure 2f).Concrete construction in an area increases runoff, and thus reduces recharge capacity of the ground.Similarly, deforestation increases runoff, and thus there is less infiltration in the soil.
TWI is important in assessing groundwater potential because this factor measures topographymoisture relationships (Figure 2h).River density is another important factor as it has an inverse relationship with infiltration and recharge.River density (7.565 km/km²) in the study area was relatively high, which indicates more runoff and thus less probability of recharge (Figure 2i).
Rainfall is one of the most important factors for groundwater potential mapping as it directly affects groundwater recharge [10,12,22].The yearly average rainfall of this area varies from 4.80 to 7.23 mm (Figure 2j).Soil and geology characteristics are very important for infiltration, recharge, and formation of aquifers [14,86].The geology map of the study area was prepared into 31 classes (Figure 2k).Permeable soil helps in more infiltration [19].In the study area, the soil map was prepared based on the soil types available in the study area (Figure 2l).(g) (h)

Modeling Methodology
The methodology of the development and validation of the groundwater potential models and their outcomes (potential maps) are shown in Figure 3, which involves the following steps: (1) Collection and preparation of data.The data was collected from various government sources, meteorological departments, and satellite images.4) Validation of models: various statistical measures were used for the validation of groundwater potential models.These measures were AUC, NPV, SST, RMSE, ACC, SPF, PPV, and Kappa.Thereafter, the generation of groundwater potential maps was carried out using GIS software.

Factor Importance
The results of factor analysis using the OneR method measured the average merit (AM) for each influencing factor, which revealed that rainfall, land use, and elevation were the most important factors for the assessment of groundwater potential in Dak Lak Province (Table 1).Apart from these three factors, the other factors that showed AM  0 also had the efficiency of being used in the modeling process.Therefore, they were all selected for modeling of groundwater potential mapping in this study.

Groundwater potential maps
Groundwater potential maps were constructed using the training results of the models.The output values varying from 0 to 1 were classified into 5 classes (very low, low, moderate, high, very (a) (b) high) (Figure 6) using the natural breaks method, as it is the most suitable and popular technique for classification and construction of groundwater susceptibility maps [84].Comparison of groundwater potential maps indicates that 51.56% of the study area is located in a very low groundwater potential zone and 11.62% in a very high zone using the BLR model.For the CGLR model, 55.53% of the study area is in a region with a very low groundwater potential and 11.95% is in a region with very high potential.In the case of the DLR model, 34.49% of the study area is located in a very low potential zone and 11.14% in a very high potential zone.For the RSSLR model, 26.02% of the study area lies in a very low potential zone and 11.11% in a very high zone.In the case of the LR model, 55.06% of the area is in a very low zone and 11.78% is in a very high zone.This suggests that 50% to 60% of the area of Dak Lak Province falls in the very low groundwater potential zone (Figure 7).

Discussion
Machine learning modeling of environmental problems has gained popularity because machine learning methods show promise when dealing with manifold geospatial data [69,[87][88][89].As such, machine learning modeling can effectively alleviate the difficulty associated with the identification of groundwater potential zones over large-scale regions, which often suffer a lack of accurate and long-term geotechnical and hydrogeological data for the implementation of physically based and/or numerical models [11].However, the utility of different machine learning methods should be broadly investigated via their applications in different regions with different geo-environmental settings to find the best model with the highest accuracy and lowest sensitivity to noisy input data [33,70,87,90,91].
The results of our study revealed that all four ensemble techniques used in this study proficiently optimized the performance of the base LR model and provided reliable estimations of groundwater potential based on different validation methods.More specifically, the results of the ROC-AUC method demonstrated that the single LR model and its four derived ensemble models with a mean AUC of 0.88 had an excellent goodness-of-fit with the training dataset.During the validation phase, performances of the models that indicate their capabilities to estimate groundwater potential [37] were decreased to a mean AUC of 0.73.Based on the relationship between AUC values and the predictive capability of the models that has been suggested in the literature [14,19,22,24,84], we can conclude that our models performed decently in estimating groundwater potential and developing distribution maps.Further, the results demonstrated the capability of ensemble learning techniques for improving LR performance.Among the four ensemble techniques used in this study, dagging was identified as the most effective technique, which was followed by RSS, bagging, and CG techniques, suggesting that dagging was the most capable model in reducing the variance, bias, and noise of the groundwater potential modeling.
Although the literature has mostly reported on the effectiveness of the ensemble techniques, these techniques showed different performances for different problems in different areas.For example, Nhu et al. [36] showed that reduced error pruning tree (REPT) performed better in combination with RSS than the bagging and AdaBoost techniques for gully erosion prediction, whereas Pham et al. [92] reported that the REPT model performed better with rotation forest and bagging than its combination with the RSS and multiboost for landslide prediction.Different results have also been reported for flood prediction based on the ensemble models [93,94].From these studies, we can conclude that the machine learning and ensemble learning techniques are greatly case-and site-specific, and that their performances depend heavily on the local conditions that the training datasets are developed upon, indicating that the application of different methods in different regions should be continued to find the optimum method for each environmental setting [95].

Concluding Remarks
In Vietnam, there is a problem of water scarcity in Dak Lak Province due to enhanced requirements for agricultural development and the occurrence of frequent drought conditions in recent years.Erratic and decreased rainfall in the area has led to overexploitation of groundwater reservoirs to meet the supply of water for day to day requirements of the province for drinking, cultivation, and industrial uses.Therefore, there is a great need for the assessment of groundwater potential and identification of the suitable areas of recharge.To address this need, four ensemble models, namely BLR, DLR, CGLR, and RSSLR, were developed to produce groundwater potential maps for the province.All four models performed well (AUC> 0.7) for the assessment of groundwater potential and generation of potential zone maps.Among the studied models, the DLR model with AUC = 0.769 and RMSE = 0.444 was the best model in comparison to the other ensemble models and the single LR model.Therefore, a groundwater potential map generated using the DLR model can be used by decision-makers in the development of effective adaptive groundwater management plans for Dak Lak Province.The models proposed in the present study can be applied in other areas for better groundwater potential mapping considering local geo-environmental factors.

Figure 1 .
Figure 1.Location map of the study area (Dak Lak Province).

( 2 )
Factor selection: The OneR feature selection method was used to validate and select the important factors for modeling use.(3) Construction of models: four ensemble models, namely DLR, BLR, CGLR, and RSSLR, were developed using the training dataset.More specifically, DLR is a combination of the LR model and the dagging ensemble technique, of which dagging was used to optimize the training datasets for the construction of the ensemble DLR model.BLR is a combination of the LR model and the bagging ensemble technique, of which bagging was used to optimize the training datasets for the construction of the ensemble BLR model.RSSLR is a combination of the LR model and the RSS ensemble technique, of which RSS was used to optimize the training datasets for the construction of the ensemble RSSLR model.CGLR is a combination of the LR model and the CG ensemble technique, of which CG was used to optimize the training datasets for the construction of the ensemble RSSLR model.(

Figure 3 .
Figure 3.The flow chart of the methodology used in this study.

Figure 4 .
Figure 4. Magnitude of modeling error in the training and validation datasets.

Figure 5 .
Figure 5. ROC curve of the models for the (a) training and (b) validation datasets.

Figure 7 .
Figure 7. Analysis of the groundwater potential maps.

Table 1 .
Factor ranks extracted using the OneR feature selection method.

Table 2 .
Model performance in the training and validation phases.