Improvement of Credal Decision Trees Using Ensemble Frameworks for Groundwater Potential Modeling

: Groundwater is one of the most important sources of fresh water all over the world, especially in those countries where rainfall is erratic, such as Vietnam. Nowadays, machine learning (ML) models are being used for the assessment of groundwater potential of the region. Credal decision trees (CDT) is one of the ML models which has been used in such studies. In the present study, the performance of the CDT has been improved using various ensemble frameworks such as Bagging, Dagging, Decorate, Multiboost, and Random SubSpace. Based on these methods, ﬁve hybrid models, namely BCDT, Dagging-CDT, Decorate-CDT, MBCDT, and RSSCDT, were developed and applied for groundwater potential mapping of DakLak province of Vietnam. Data of 227 groundwater wells of the study area were utilized for the construction and validation of the models. Twelve groundwater potential conditioning factors, namely rainfall, slope, elevation, river density, Sediment Transport Index (STI), curvature, ﬂow direction, aspect, soil, land use, Topographic Wetness Index (TWI), and geology, were considered for the model studies. Various statistical measures, including area under receiver operating characteristic ( AUC ) curve, were applied to validate and compare the performance of the models. The results show that performance of the hybrid CDT ensemble models MBCDT ( AUC = 0.770), BCDT ( AUC = 0.731), Dagging-CDT ( AUC = 0.763), Decorate-CDT ( AUC = 0.750), and RSSCDT ( AUC = 0.766) improved signiﬁcantly in comparison to the single CDT ( AUC = 0.722) model. Therefore, these developed hybrid models can be applied for better ground water potential mapping and groundwater resources management of the study area as well as other regions of the world. 20% in moderate, 8% in high and 11% in the very high zone. In case of MBCDT model, 70% of the study area is located in very low, 7% in low zones, 3% in moderate, 2% in high and 13% in very high zones. Groundwater potential map in case of RSSCDT model showed that 10% of the area is in very low zones, 40% in low zones, 20% in moderate zones, 25% in the high zone and 15% in the very high zone (Figure 11). All the generated maps showed that high to very high groundwater potential areas are located in the central part of the study area. water


Introduction
Groundwater is a vital natural resource for drinking water supply, irrigation and industries in many countries [1][2][3]. About 2.5 billion people all over the world depend on groundwater resources for drinking and agriculture [4]. Most of the world's groundwater resources are being overexploited, and thus acute water shortage is expected by 2025 all around the world as the fresh water resources are limited [5][6][7]. Population growth creates higher demand for water for domestic use, in addition to industrial development and extension of irrigated areas [8,9]. This problem is more prevalent in the arid and semi-arid regions, which have faced numerous drought events in recent years due to erratic scanty rainfall [10,11]. Thus, the identification and mapping of groundwater potential zones is an important task to recharge the aquifer. In recent years, several researchers, namely Magesh, Chandrasekar and Soundranayagam [1], Oikonomidis, Kazakis, Voudouris et al. [12], Rahmati, Samani, Mahdavi, et al. [13], and Zabihi, Pourghasemi, Pourtaghi, et al. [14], have studied groundwater potential, considering geological, hydrological and climatic factors using statistical methods, remote sensing, and geographic information system (GIS) technology [15]. Traditionally, expert's opinion-based models or weighted models have been used for groundwater potential mapping. However, these approaches are considered subjective and uncertainty [16,17].
Nowadays, artificial intelligence (AI)-based machine learning (ML) models are being utilized for mapping of groundwater potential with the advancement of spatial data acquisition and analysis. ML models are based on computational algorithms to deal with complex problems with complex datasets [18]. Chen, Li, Tsangaratos, et al. [19] used ML models based on Random Forest (RF), Kernel Logistic Regression (KLR), and Alternating Decision Tree (ADT) for groundwater potential mapping in China. Naghibi, Pourghasemi and Dixon [20] applied and compared several ML models, namely Classification and Regression Tree (CART), Boosted Regression Tree (BRT) and RF for GIS-based groundwater potential mapping in Iran. Lee, Hong and Jung [21] used Artificial Neural Network (ANN) and Support Vector Machines (SVM) models to develop groundwater potential maps in Korea. Park, Hamm, Jeon, et al. [22] compared two ML-based models of Multivariate Adaptive Regression Splines and Logistic Regression (LR) for groundwater potential mapping in Korea. Ozdemir [23] applied LR for mapping of groundwater potential in Turkey. Other popular ML-based models used for groundwater potential mapping are Adaptive Network-based Fuzzy Inference System [24], Naïve Bayes [25], K-nearest neighbor and Quadratic Discriminate Analysis [26]. Although all these single ML models performed well in the studied regions, there is no model available that can be applied to all regions including hybrid models [25] for optimal groundwater potential mapping.
With the above objective, the present study was carried out to fill the gap of suitable and better models by improving the predictive capability of Credal Decision Trees (CDT), which is a popular machine learning method but quite sensitive with tree construction [27,28]. Different ensemble frameworks namely Bagging, Dagging, Decorate, Multiboost, and Random SubSpace were used to develop five hybrid models with base classifier CDT such as BCDT, Dagging-CDT, Decorate-CDT, MCDT, and RSSCDT. For the model studies, the DakLak province of Vietnam was selected, where groundwater resources are required to be properly exploited as rainfall in this area is erratic due to climate change's effects [29,30]. To validate the models, several statistical measures, including area under the receiver operating characteristics (ROC) curve (AUC), were applied on the datasets. GIS and Weka software were used for data preparation, analysis and modeling.

Credal Decision Trees (CDT)
CDT is a classifier which is based on uncertainty measures and imprecise probabilities. CDT was first proposed in 2003 by Abelléan and Moral to solve the classification problem using credal sets [27]. To avoid complicated decision tree production, a new concept was developed, which is stopping the classification process from increasing the total uncertainty due to ramification of the decision tree [31]. Therefore, a new advanced method is built to measure quantitatively the total uncertainty from credal set based on the theory of Dempster and Shafer, as presented in following equation: where x is defined as a credal set on frame X, TU is the total uncertainty value, IG is defined as a general function of non-specificity on the corresponding set of credits and GG is defined as a general randomness function for a credal set [32].

Bagging
Bagging is an ensemble technique that combines many ML classifiers together to create more accurate predictors. The Bagging algorithm is constructed from the combination of Bootstrap and Aggregating to create a unique overall model [33,34]. Bagging is a sensitive algorithm. In the Bagging method, small changes in the dataset can cause significant changes in the final results [35]. In this algorithm, learning data to be used for each learner is obtained by bootstrap sampling, and the learned learner is used for prediction and the final ensemble [36]. Bagging produces better accuracy as it can perform more independent learning.

Dagging
Dagging was first proposed by Ting and Witten in 1977. It uses certain separate samples instead of Bootstrap samples to extract the basic classifications [37]. The name of Dagging is the original of Bagging. In the Dagging algorithm, the dataset is used to classify once, and it is also disjointed [38]. In this model, majority voting is used to group the classifications to improve the accuracy of basic classification prediction [39].

Decorate
Decorate algorithm was introduced by Melville and Mooney in 2003 [40] to improve training data by creating artificial data. These data are constructed using the training variables of means and standard deviation according to the Gaussian distribution. They are added to the training samples. The difference between Decorate and other ensembles (Bagging and Adaboost) is that Adaboost and Bagging use only given training variables to create the various classifications [41], whereas Decorate builds the basic classifications using artificial data, which allows us to no longer be constrained by the training samples given when managing a set.

MultiBoost
Multiboost was introduced by Webb in 2000 [42]. This technique is produced by combining Adaboost and Wagging techniques to reduce the problem of variance and over-fitting [43]. The use of training boxes with different weights in the Wagging model can reduce the high bias in Adaboost model [44]. Combination of Adaboost and Wagging is an advantage in the classification process of weak learning and transforming it into strong learning [42]. Multiboost is formed in three stages. The first is randomly selecting a subset from the original data which are used to form the models. The second is the weights, which show the changes in the model prediction. In the third, the new subsets are chosen from the weighted instance to produce the new models [45].

Random SubSpace
Random SubSpace is considered to be one of the most popular random sampling methods which Ho proposed in 1988 to improve predictive capability of the individual classifications and accuracy of weak classifications [43,46]. In this technique, the original characteristic vector with the strong dimension is randomly divided to construct the subspace with a small dimension and then several classifications are randomly grouped in subspace at the final decision [46]. The subset characteristic series of each sub-classification formation to the final prediction results are grouped using a majority vote [47].

Correlation-based Feature Selection
Selection of the appropriate factors is a very important task for constructing input variables and testing the ML models [20,48]. It can help to assess each variable in predicting outcomes by removing unnecessary factors from the input data. Therefore, the quality of the data is improved by reducing over-fitting and the noise-related problems. This leads to an increase in the model's predictive capacity [49]. There are several methods to select variables, such as ORAE, Gain information, and correlation-based feature selection [50]. Among them, correlation-based feature selection was selected in the present study. This method evaluates the attributes of the target class. It can be used to measure the correlations between each input variable and the output variable on which importance of input variables is evaluated and ranked [51].

Validation Methods
The performance of the models is evaluated by validation methods [52,53]. In addition, comparison of the training data and validation data plays an important role in determining the fit of the model [54]. PPV and NPV present the percentage of pixels which are correctly predicted as "potential groundwater" and "non-potential groundwater" [55,56]. Meanwhile, SST and SPF express the pixels which are correctly classified as "potential groundwater" and "non-potential groundwater" [57]. ACC shows the proportionality of classification "true negative" and "true positive" for the test, which are the pixel rate which is correctly classified from "potential groundwater" and "non-potential groundwater" [58,59]. False Positive (FP) and True Positive (TP) are considered to be the probability of a pixel which is incorrectly and correctly classified from "groundwater potential", respectively, while False Negative (FN) and True Negative (TN) show the probability of a pixel which is incorrectly and correctly classified as "non-potential groundwater" [60]. These statistical measures can be calculated by following equations: Kappa (k) index is considered to be one of the most popular statistical measures for evaluating the ML models. Kappa presents the percentage of the agreement between the evaluators. The Kappa is often considered as a random chord. It was used to classify N objects into C mutually exclusive sets. The value of kappa ranges between −1 and 1. If kappa equals 1, the model has perfect performance [61][62][63]. Kappa (k) can be calculated by following equation: where P p is the accuracy and P exp is the expected agreements.
RMSE is the statistical index to assess differences between the predictive value and the target value [64][65][66]. RMSE is a good metric for the comparison between the performances of the models, which is calculated as follows: where n is defined as the total of variables, X predicted and X actual are the prediction and actual values of variable i-th.
ROC is a graph commonly used in the validation of binary classification models. This curve is created by expressing sensitivity and specificity [67,68]. Therefore, the ROC curve will show the relationship, the trade-off and the significance of choosing an appropriate model of sensitivity and false alarm rate. Area under the ROC curve, called AUC, is often utilized quantitatively to validate and compare predictive capability of the models, which is calculated as follows: where P and N are defined as the total number of "potential-groundwater" and "non-potential groundwater" samples, respectively.

Study Area
The study area of DakLak province is located in between 107 • 28'57" to 108 • 59'37" East longitude; and 12 • 9'45" to 13 •  In general, the climate of the DakLak province varies as per the variation of topography. The area below 300 m elevation is hot, that between elevation 400 and 800 m is hot and humid; and that above 800 m is cold. In this region, about 90% of the annual rainfall occurs during the rainy season (May to October) and is almost negligible during summer (November to April).
Groundwater resource in the DakLak province is widely used for all needs, especially for irrigation. According to the Vietnam Academy for Water Resources (2018), in the dry season, the total volume of water needed is 264,000 hectares. For coffee cultivation, the total water requirement is about 660 million m 3 against the availability of surface water of 250 million m 3 . Therefore, the remaining water requirement is to be met by groundwater for coffee production as well as for other crop cultivation to avoid drought conditions. Currently, the amount of water exploited in the dry months in this province is estimated to be about 500,000 m 3 /day for irrigation, which is mainly concentrated in the Basalt Complex. About one third of the study area is covered by basalt rock and remaining by quaternary sediments, Pliocene formation and Proterozoic metamorphic rocks.

Well Yields
Well yield data of 227 wells of the DakLak province obtained from the Vietnam Academy for Water Resources (VAWR) were used in the present study (VAWR 2018). The data were split into two parts: 70% of the data were used to train the model, and the remaining 30% of the data for the validation of the model. Based on the local conditions and requirements, 1.6 l/s yield of wells was used as a threshold value for the model study [69].

Groundwater Influencing Parameters
In the groundwater model study, the groundwater influencing parameters or conditioning factors based on topography, hydrology, geo-environmental conditions and anthropogenic activities play an important role in the model's predictive capacity [70]. In the present study, 12 groundwater affecting factors, namely aspect, curvature, elevation, slope, Sediment Transport Index (STI), flow direction, rainfall, river density, soil type, Topographic Wetness Index (TWI), land use, and geology (lithology), were selected for modeling. Topography and hydrology factors were extracted from the Aster Digital Elevation Model (DEM) of 30m resolution from the United States Geological Survey (USGS) website (https://earthexplorer.usgs.gov/) using GIS application and SAGA software [71]. Land use map (scale 1:50000) and soil map (scale 1:100000) was obtained from the Daklak Department

Well Yields
Well yield data of 227 wells of the DakLak province obtained from the Vietnam Academy for Water Resources (VAWR) were used in the present study (VAWR 2018). The data were split into two parts: 70% of the data were used to train the model, and the remaining 30% of the data for the validation of the model. Based on the local conditions and requirements, 1.6 l/s yield of wells was used as a threshold value for the model study [69].

Groundwater Influencing Parameters
In the groundwater model study, the groundwater influencing parameters or conditioning factors based on topography, hydrology, geo-environmental conditions and anthropogenic activities play an important role in the model's predictive capacity [70]. In the present study, 12 groundwater affecting factors, namely aspect, curvature, elevation, slope, Sediment Transport Index (STI), flow direction, rainfall, river density, soil type, Topographic Wetness Index (TWI), land use, and geology (lithology), were selected for modeling. Topography and hydrology factors were extracted from the Aster Digital Elevation Model (DEM) of 30m resolution from the United States Geological Survey (USGS) website (https://earthexplorer.usgs.gov/) using GIS application and SAGA software [71]. Land use map (scale 1:50000) and soil map (scale 1:100000) was obtained from the Daklak Department of Natural Resources and Environment (DARD). Geology and rainfall maps were extracted from the hydrogeological map (1:300.000 scale) of South Central and Central Highland Vietnam conducted by the Central region of Vietnam Division for Water Resources Planning and Investigation (CEVIWRPI).
The aspect map shows the direction of the slope [72][73][74]. In this study, the aspect map is divided into nine classes (Figure 2a). The curvature map indicates the relationship with the ability to accumulate and retain water on the surface. Normally, the concave slope accumulates more water [48,75,76]. In this region, curvature ranges from 23.5 to 30.8 (Figure 2b). Elevation is considered as one of the most important factors in the groundwater potential model as it has the inverse proportionality with the potential of underground water [77]. In the study region, elevation ranges from 117 to 2424 m (Figure 2c). Slope has a direct relationship with the hydrological process. On flat ground, the accumulation of surface water would be more and thus more infiltration is likely, which would help in groundwater recharge [76]. Slope in the DakLak province is grouped in different classes based on the natural break method between 0 and 69.9 degrees (Figure 2d).

Methodological Flow Chart
The methodology of the present groundwater potential model study is divided into four main stages: (1) GIS data collection and preparation, (2) correlation-based feature selection and generation of datasets, (3) hybrid model construction, and (4) performance assessment and final trained hybrid models (Figure 3). More specifically, (1) groundwater inventory map and conditioning factor maps (l) STI helps in assessing erosion and deposition [78,79]. In this region, it varies from 0 to 25,019 (Figure 2e). TWI reflects the relationship between topography and the condition of the groundwater occurrence [80]. In this area, the value of TWI ranges from 6.04 to 20.433 (Figure 2f). Flow direction indicates the direction of runoff from higher to low region thus affecting infiltration [81,82]. In this area, the flow direction value ranges from 1 to 255 (Figure 2g). Rainfall is considered as an important factor for groundwater potential mapping because the chances of infiltration are greater in cases of high precipitation, thus leading to more recharge [83,84]. The average yearly rainfall value in the study area ranges from 4.80 to 7.23 mm (Figure 2h). River density is the inverse proportionality with infiltration [48,[83][84][85]. The study area has a high river density (7.565km/km 2 ) thus less probability of recharge (Figure 2i).
Soil is also an important factor in the modeling of groundwater potential. Permeability of the soil depends on its texture and structure which reflects the infiltration capacity of the soil [86][87][88]. The soil map of the study area is grouped into different classes based on local variations of soil properties (Figure 2j and Table 1). Land use depends on the topography, nature of the soil, hydrology, meteorology and human (anthropogenic) requirement. Anthropogenic activities generally change the land use pattern, thus affecting groundwater potential locally [48,89]. In this study, the land use map was classified into various classes (G1 to G18) (Figure 2k and Table 2). Geology plays an important role in groundwater occurrence and thus in modeling of groundwater potential. Geological structure affects surface water infiltration (recharge) and groundwater movement. The porosity and permeability of rocks are important for assessing the characteristics of the ground surface and aquifer [90,91]. The geology map of the region was classified into different types of formation based on the characteristics of rocks (Figure 2l).

Methodological Flow Chart
The methodology of the present groundwater potential model study is divided into four main stages: (1) GIS data collection and preparation, (2) correlation-based feature selection and generation of datasets, (3) hybrid model construction, and (4) performance assessment and final trained hybrid models (Figure 3). More specifically, (1) groundwater inventory map and conditioning factor maps were prepared and analyzed to develop groundwater potential map. As the original data of these maps were on different scales (units), they were normalized to values from 0 to 1 for the use as model input data [92]; (2) correlation-based feature selection was used to validate and select the suitable conditioning input factors for groundwater potential assessment, and then inventory data was split into two parts: the first part was used to build the model with 70% of the data (training data), and another 30% (testing data) were used to validate the model; (3) various hybrid ensemble framework-based models in the combination of single models, namely single CDT, BCDT, Dagging-CDT, Decorate-CDT, MBCDT, and RSSCDT, were constructed using training datasets. A list of the model parameters utilized for training the models is presented in Table 3; (4) groundwater potential models were validated using various statistical measures: SST, SPF, ACC, K, PPV, NPV, RMSE and AUC. After the validation of the models, groundwater potential maps were constructed using the studied models. These maps were classified into five classes: very high, high, moderate, low and very low based on the natural break classification method [93] in GIS application. Table 3. List of the parameters used in different models.

No
Parameter

Training data (70%)
Validation data (30%) 4. Performance assessment and the final trained hybrid models Ensembles Figure 3. Methodological flow chart of this study.

Analysis of Feature Selection of Groundwater Potential Influencing Factors
Groundwater potential influencing factors are selected based on the field knowledge of the area, including geology, topography, geomorphology, meteorology, land use pattern and anthropogenic activities [48,94,95]. At present, there is currently no known best method which can help in selecting the appropriate influencing factors for the groundwater potential assessment universally for all the areas [54,96,97]. However, to accomplish this task, at present correlation based feature selection method is considered to be one of the most popular methods due to its ability to take into account the impacts of each variable [49]. Therefore, in this study, this method was applied to 12 initially considered factors: land use, slope, elevation, river density, STI, curvature, TWI, flow direction, aspect, soil, geology, and rain fall. The results show that all these factors (variables) contributed to the groundwater potential model, but among these, land use and rainfall are the most important factors in the study area ( Figure 4).

Analysis of Feature Selection of Groundwater Potential Influencing Factors
Groundwater potential influencing factors are selected based on the field knowledge of the area, including geology, topography, geomorphology, meteorology, land use pattern and anthropogenic activities [48,94,95]. At present, there is currently no known best method which can help in selecting the appropriate influencing factors for the groundwater potential assessment universally for all the areas [54,96,97]. However, to accomplish this task, at present correlation based feature selection method is considered to be one of the most popular methods due to its ability to take into account the impacts of each variable [49]. Therefore, in this study, this method was applied to 12 initially considered factors: land use, slope, elevation, river density, STI, curvature, TWI, flow direction, aspect, soil, geology, and rain fall. The results show that all these factors (variables) contributed to the groundwater potential model, but among these, land use and rainfall are the most important factors in the study area (Figure 4).

Evaluation of Models Performance Using Statistical Methods
Groundwater potential models were constructed using training data and validated by testing data [98,99]. Weka software was used for the modeling. For training data, the results indicate that the MBCDT model is better in terms of PPV and SPF, whereas, in terms of NPV value, the Dagging-CDT model is better in comparison to other models. However, the RSSCDT model is more efficient than the other models for SST Kappa and ACC values ( Figure 5 and Figure 6). The results of the validation data suggest that the RSSCDT model is more efficient than other models in terms of NPV, SST, ACC and Kappa values ( Figure 5 and Figure 6).

Evaluation of Models Performance Using Statistical Methods
Groundwater potential models were constructed using training data and validated by testing data [98,99]. Weka software was used for the modeling. For training data, the results indicate that the MBCDT model is better in terms of PPV and SPF, whereas, in terms of NPV value, the Dagging-CDT model is better in comparison to other models. However, the RSSCDT model is more efficient than the other models for SST Kappa and ACC values (Figures 5 and 6). The results of the validation data suggest that the RSSCDT model is more efficient than other models in terms of NPV, SST, ACC and Kappa values (Figures 5 and 6).
Analysis of the model's performance was also done using RMSE values. The results indicate that the BCDT model is the best in terms of training data (Figure 7), whereas the RSSCDT model is more efficient for the validation data in comparison to other models (Figure 8).    Figure 6. Performance of the models using Kappa criteria. Figure 6. Performance of the models using Kappa criteria.  . RMSE analysis of the models using testing data set. Figure 8. RMSE analysis of the models using testing data set.
Comparative analysis of models' performance using AUC values indicated that the BCDT model is better with AUC: 0.933, followed by the RSSCDT model (0.909), Decorate-CDT model (0.901), MBCDT (0.899), Dagging-CDT (0.856) and CDT (0.819), respectively, in terms of training data (Figure 9). In terms of validation data, the MBCDT model showed better predictive performance with AUC: 0.77, followed by RSSCDT (0.766), Dagging-CDT (0.763), Decorate-CDT (0.75), BCDT (0.731), and CDT (0.722), respectively. In general, the results of the model study show that all the models have AUC > 0.7, thus they are all efficient in building the groundwater potential maps.

Evaluation and Validation of Groundwater Potential Maps
In the present study, groundwater potential maps were developed using six models: CDT, BCDT, Dagging-CDT, Decorate-CDT, MBCDT, and RSSCDT. These maps were constructed in five groups (very low, low, moderate, high and very high) of groundwater potential zones ( Figure 10). Analysis of groundwater potential maps suggested that in case of CDT model; about 80% of the area is located in very low, 5% in low, and 15% in very high potential zones. For the BCDT model, 50% of the area is in very low, 20% in low, 10% in moderate, 7% in high and 13% in the very high potential zones. In the case of the Dagging-CDT model, about 35% is located in very low, 25% in low, 10% in moderate, 7% in high and 13% in the very high zones. For the Decorate-CDT model, 35% of the area is located in very low, 26% in low, 20% in moderate, 8% in high and 11% in the very high zone. In case of MBCDT model, 70% of the study area is located in very low, 7% in low zones, 3% in moderate, 2% in high and 13% in very high zones. Groundwater potential map in case of RSSCDT model showed that 10% of the area is in very low zones, 40% in low zones, 20% in moderate zones, 25% in the high zone and 15% in the very high zone ( Figure 11). All the generated maps showed that high to very high groundwater potential areas are located in the central part of the study area. Thus, these groundwater potential maps can be used as scientific documents to assist decision-makers in land use planning and water resource management.

Evaluation and Validation of Groundwater Potential Maps
In the present study, groundwater potential maps were developed using six models: CDT, BCDT, Dagging-CDT, Decorate-CDT, MBCDT, and RSSCDT. These maps were constructed in five groups (very low, low, moderate, high and very high) of groundwater potential zones ( Figure 10). Analysis of groundwater potential maps suggested that in case of CDT model; about 80% of the area is located in very low, 5% in low, and 15% in very high potential zones. For the BCDT model, 50% of the area is in very low, 20% in low, 10% in moderate, 7% in high and 13% in the very high potential zones. In the case of the Dagging-CDT model, about 35% is located in very low, 25% in low, 10% in moderate, 7% in high and 13% in the very high zones. For the Decorate-CDT model, 35% of the area is located in very low, 26% in low, 20% in moderate, 8% in high and 11% in the very high zone. In case of MBCDT model, 70% of the study area is located in very low, 7% in low zones, 3% in moderate, 2% in high and 13% in very high zones. Groundwater potential map in case of RSSCDT model showed that 10% of the area is in very low zones, 40% in low zones, 20% in moderate zones, 25% in the high zone and 15% in the very high zone ( Figure 11). All the generated maps showed that high to very high groundwater potential areas are located in the central part of the study area. Thus, these groundwater potential maps can be used as scientific documents to assist decision-makers in land use planning and water resource management.

Discussion
Groundwater resources are an important source for potable water, which is also used for agriculture and industry [100][101][102][103]. The mapping of groundwater potential is an essential task to assess groundwater potential of the area for better groundwater resource management. Even though many studies have been carried out to map the groundwater potential in various regions of the words using different approaches [54,104], but more efforts are needed to improve the quality of these maps for predicting accurate groundwater potential zones [16]. Nowadays, advanced ML techniques are being used for this purpose [25,105]. In this study, different ensemble ML techniques, namely Bagging, Dagging, MultiBoost, Random SubSpace, and Decorate, were used to improve the performance of a single ML model, namely CDT, to develop various hybrid models (BCDT, Dagging-CDT, Decorate-CDT, MBCDT and RSSCDT) for the improvement of the performance of groundwater potential mapping in the DakLak province, Vietnam.

Discussion
Groundwater resources are an important source for potable water, which is also used for agriculture and industry [100][101][102][103]. The mapping of groundwater potential is an essential task to assess groundwater potential of the area for better groundwater resource management. Even though many studies have been carried out to map the groundwater potential in various regions of the words using different approaches [54,104], but more efforts are needed to improve the quality of these maps for predicting accurate groundwater potential zones [16]. Nowadays, advanced ML techniques are being used for this purpose [25,105]. In this study, different ensemble ML techniques, namely Bagging, Dagging, MultiBoost, Random SubSpace, and Decorate, were used to improve the performance of a single ML model, namely CDT, to develop various hybrid models (BCDT, Dagging-CDT, Decorate-CDT, MBCDT and RSSCDT) for the improvement of the performance of groundwater potential mapping in the DakLak province, Vietnam.
Based on the results of model validation, it can be stated that the proposed ensemble frameworks improved the performance of a single the CDT base classifier model for better groundwater potential mapping. This may be due to the fact that in CDT algorithm, the sub-dataset formed is different from a given problem domain which produces quite different trees [106,107]. This feature is very necessary to build the appropriate classification to increase the classification capacity of Random SubSpace, Bagging and Multiboost models [106][107][108]. Bagging is considered to be an important algorithm for improving the accuracy of individual classification prediction by creating different classifications together. In the present study, Bagging used the Radial Basis Function (RBF) kernel function to improve the stability capacity of CDT model. In addition, in Bagging algorithm, the Bootstrap sampling method was used to decrease the sensitivity of an individual classification for noise problem in training data [33]. In the Bagging model, the base classification generation errors are moved to the generation errors, which are calculated on the smaller training data and this model is useful for low classification [62,109]. The Dagging method has the advantage of reducing noise. Although Decorate is not known as the Bagging or Multiboost algorithm, it is the efficient algorithm as it enhances the original training data by creating artificial data and then producing various classifications on artificial samples. Therefore, this algorithm is presenting an advantage for small scale training datasets [41]. Literature survey indicated that Multiboost can reduce the average errors in terms of bias. In this method, the original training dataset is divided into several sub-datasets, which can be treated at the same time [42,43,57]. The findings of this study are also in line with the other studies [106,107].
In the present study, various validation criteria, namely SST, SPF, ACC, K, PPV, NPV, RMSE and AUC, were selected and used for validation and comparison of the models. It can be seen that the comparative performance of the models is different with different statistical criteria. For example, RSSCDT is better than other models in term of NPV, SST, ACC and Kappa ( Figures 5 and 6), but MBCDT is better than other models in term of AUC ( Figure 9). Thus, in this study, it can be stated that ensemble frameworks improved the performance of the single CDT base classifier but it is very difficult to assess which ensemble method is the best from the applied validation criteria.

Conclusions
In this study, various ensemble techniques, namely Bagging, Dagging, Decorate, MultiBoost, and Random SubSpace, were used to the improve performance of a single CDT base classifier for the generation of accurate groundwater potential maps. The performance of five developed hybrid models, namely BCDT, Dagging-CDT, Decorate-CDT, MBCDT, and RSSCDT, was evaluated and compared with the single CDT model.
Validation results show that although all the models are efficient in groundwater potential mapping in the study area (AUC > 0.70), the performance of the ensemble models MBCDT (AUC = 0.770), BCDT (AUC = 0.731), Dagging-CDT (AUC = 0.763), Decorate-CDT (AUC = 0.750), and RSSCDT (AUC = 0.766) improved significantly in comparison to single CDT model (AUC = 0.722). Thus, these developed hybrid models can be applied for better ground water resources management of the study area as well as other regions of the world.
Groundwater potential zones identified through mapping using developed hybrid (ensemble) models would help managers in prioritizing the area for future development of groundwater resources and their systematic exploitation, considering annual needs and recharge of the area by maintaining water balance. All the stakeholders, including government and non-government agencies and individuals, can use these maps for the sustainable development of the area. Based on these maps, local inhabitants can also be provided with technical help and monetary support in drought affected areas for the construction and maintenance of recharge structures at suitable locations.
The results of this study would be helpful not only in the proper management of the DakLak province of the Vietnam but also for the ground water potential mapping and assessment of other drought prone areas of the world.