Groundwater Spring Potential Mapping Using Artiﬁcial Intelligence Approach Based on Kernel Logistic Regression, Random Forest, and Alternating Decision Tree Models

: This study presents a methodology for constructing groundwater spring potential maps by kernel logistic regression, (KLR), random forest (RF), and alternating decision tree (ADTree) models. The analysis was based on data concerning groundwater springs and fourteen explanatory factors (elevation, slope, aspect, plan curvature, proﬁle curvature, stream power index, sediment transport index, topographic wetness index, distance to streams, distance to roads, normalized di ﬀ erence vegetation index (NDVI), lithology, soil, and land use), which were divided into training and validation datasets. Ningtiaota region in the northern territory of Shaanxi Province, China, was considered as a test site. Frequency Ratio method was applied to provide to each factor’s class a coe ﬃ cient weight, whereas the linear support vector machine method was used as a feature selection method to determine the optimal set of factors. The Receiver Operating Characteristic curve and the area under the curve (AUC) were used to evaluate the performance of each model using the training dataset, with the RF model providing the highest AUC value (0.909) followed by the KLR (0.877) and ADTree (0.812) models. The same performance pattern was estimated based on the validation dataset, with the RF model providing the highest AUC value (0.811) followed by the KLR (0.797) and ADTree (0.773) models. This study highlights that the artiﬁcial intelligence approach could be considered as a valid and accurate approach for groundwater spring potential zoning.


Introduction
As pointed out by many researchers, one of the most important natural resource worldwide is groundwater, with one third of the world's population depending on it [1][2][3][4]. Several areas in the world are subject to overexploitation of groundwater, undergoing water shortages as a result of a difference between water supply and demand [5]. It is also well established that the demand for groundwater

Study Area
Ningtiaota region is located in the northern territory of Shaanxi Province, China. The climate is characterized as dry throughout the year. The maximum and minimum temperatures are 38.9 • C and −29.0 • C, respectively, the average relative humidity is 56%, the average wind speed is 13.4 m/s, and the average annual rainfall is 434.1 mm. The study area, which is a portion of the Ningtiaota region, defined and limited to the area where data were available, is a geographical area of 119.77 km 2 , located within latitudes 38 • 57 30 to 39 • 7 57 N and longitudes 110 • 9 36 to 110 • 16 20 E (Figure 1). According to the Soil Map produced by the Institute of Soil Science, Chinese Academy of Sciences [45], the typical soil types that cover the study region are Calcari-Gypsiric Arenosols (Arc), Haplic Arenosols (ARh), Calcareous Red Clay (CMe), and Luvi-Calcic Kastanozems (KSk) Topographically, altitudes vary from 1118 to 1364 m above the sea level, and slope gradients vary from 0 to 37.88 • based on a digital elevation model (DEM) with a 30 m regular grid. Approximately 75.38% of the area appears with less than 10 • slope surface, whereas only 0.097% of the total study area have slopes greater than 30 • . Areas with the slopes between 10 and 20 • and 20 and 30 • account for 21.77% and 2.74%, respectively.

Study Area
Ningtiaota region is located in the northern territory of Shaanxi Province, China. The climate is characterized as dry throughout the year. The maximum and minimum temperatures are 38.9 °C and −29.0 °C, respectively, the average relative humidity is 56%, the average wind speed is 13.4 m/s, and the average annual rainfall is 434.1 mm. The study area, which is a portion of the Ningtiaota region, defined and limited to the area where data were available, is a geographical area of 119.77 km 2 , located within latitudes 38°57′30′′ to 39°7′57′′ N and longitudes 110°9′36′ to 110°16′20′′ E ( Figure 1). According to the Soil Map produced by the Institute of Soil Science, Chinese Academy of Sciences [45], the typical soil types that cover the study region are Calcari-Gypsiric Arenosols (Arc), Haplic Arenosols (ARh), Calcareous Red Clay (CMe), and Luvi-Calcic Kastanozems (KSk) Topographically, altitudes vary from 1118 to 1364 m above the sea level, and slope gradients vary from 0 to 37.88° based on a digital elevation model (DEM) with a 30 m regular grid. Approximately 75.38% of the area appears with less than 10° slope surface, whereas only 0.097% of the total study area have slopes greater than 30°. Areas with the slopes between 10 and 20° and 20 and 30° account for 21.77% and 2.74%, respectively.

Methodology
The developed investigation approach followed in the present study was a four-step procedure: (i) data selection, generation of the spring inventory map and selection of nonspring areas, (ii) application of the Frequency Ration (FR) method and the linear support vector machine (LSVM) as a feature selection method so as to quantify the contribution of each explanatory factor and determine the optimal set of factors that have high predictive power, construction of the training and validation dataset, (iii) application of the Kernel logistic regression (KLR), Random Forest (RF), and Alternating Decision Tree (ADTree) models, and (iv) validation and comparison of the developed models. Figure 2 highlights the flowchart of the followed methodology, and each method used in our study will be briefly described in the following paragraphs.

Methodology
The developed investigation approach followed in the present study was a four-step procedure: (i) data selection, generation of the spring inventory map and selection of nonspring areas, (ii) application of the Frequency Ration (FR) method and the linear support vector machine (LSVM) as a feature selection method so as to quantify the contribution of each explanatory factor and determine the optimal set of factors that have high predictive power, construction of the training and validation dataset, (iii) application of the Kernel logistic regression (KLR), Random Forest (RF), and Alternating Decision Tree (ADTree) models, and (iv) validation and comparison of the developed models. Figure  2 highlights the flowchart of the followed methodology, and each method used in our study will be briefly described in the following paragraphs.

Frequency Ratio (FR)
The FR model, as a popular and efficient bivariate statistical techniques, is mainly used to estimate the potential probabilistic relation between dependent and independent factors but also for

Frequency Ratio (FR)
The FR model, as a popular and efficient bivariate statistical techniques, is mainly used to estimate the potential probabilistic relation between dependent and independent factors but also for the potential relation of multi-classified maps [35]. According to the FR model, the following formula (Equation (1)) was used to calculate the FR values for classes of the groundwater spring conditioning factors: where FR ij is the frequency ratio of a ith class for the jth factor, Spr ij is the number of pixels with spring pixels in the ith class area of the jth factor, Spr T is the total number of springs, Sur ij is the number of pixels in the ith class area of the jth factor, and Sur T is the total number of pixels.

Selection of Spring Explanatory Factors Using an SVM Classifier
The quality of groundwater spring potential mapping is influenced by the quality and quantity of the input data and also the predictive models that were used [15]. It is well known that groundwater spring explanatory factors may have unequal predictive capability in groundwater spring potential modeling. Therefore, groundwater spring explanatory factors that are characterized by negligible predictive capability should be not included in the analysis as they may produce less accurate results. Feature selection methods and specifically the gain ratio [46] and information gain ratio [47] are mainly used for estimating the predictive capability, however, in our case, the linear support vector machine (LSVM) method was used [48]. The determination of the contributions of the 14 groundwater spring explanatory factors was carried out as follows (Equation (2)) [49,50]: where m = (m 1 , m 2 , m 3 , . . . , m 12 ) is the input vector, w T is the inverse matrix, and n is the offset from the origin of the hyper-plane.

Kernel Logistic Regression (KLR)
Kernel logistic regression is considered as a powerful discriminative method, described as the kernel version of logistic regression capable of transferring into a high-dimensional feature space the original input feature space by using kernel functions [51]. The following kernel function (Equation (3)) is the basic function in which ϕ is assumed to be unknown: where T is the inner product in the Z space.
The training dataset has n vector input samples (x i , Y i ) with xi belonging to R n and Y i belonging to {−1, 1}, where x i is the ith input vector sample and Y is the target value. For Y i = 1, the x i is characterized as class 1, whereas for Y i = −1, x i is characterized as class 2. Let Z i = ϕ(x i ) . Hence, the kernel-based method will solve the following optimization problem (Equation (4)): where C corresponds to a regularization parameter, the optimal value of which is estimated by using techniques such as cross validation or a grid search technique. For the KLR function, the g is estimated by the following (Equation (5)): The goal of KLR is to estimate a discrimination function that distinguishes the two classes perfectly, in our case, spring from non-spring areas: where p is the logistic function with values ranging between −1 and 1, K(x i , x j ) is the kernel function that takes into consideration the Mercer's condition [52], α i is a vector of dual parameters and b is the intercept. Several kernel functions can be used, such as the linear kernel, the polynomial, and normalized polynomial [53]. However, in our case, the radial basis function (RDF) kernel was considered to be carried out: where σ is a tuning parameter.

Random Forest (RF)
RF is an ensemble method of binary decision trees that are trained separately, and it is appropriate for classification and regression problems [54]. The fundamental approach used for classification problems by RF is based on training separately each decision tree, whereas the final outcome is estimated by taking into account the results obtained by each decision tree [55].
RF models have the ability to generalize and minimize the risk of over fitting, without having to undergo any pruning process. The training involves creating a number of different bootstrap samples from the original dataset, with one-third being left out of the process to act as test cases and based on this test cases to estimate an unbiased test error, referred to as the out-of-bag-error, that expresses the predictive ability of the RF model [56].

Alternating Decision Tree (ADTree)
ADTree is a combination of a decision tree and boosting techniques that generates classification rules with less nodes, is easier to explain, and provides a measure of confidence that is called the classification margin [57].
ADTree are similar to the option trees first described by Buntine [58] and further developed by Kohavi [59]. Compared to a single decision tree, option trees achieve a significant improvement in classification error. The ADTree's structure is similar to option trees since they also use a boosting technique and achieve better performance levels [60]. Because of the boosting iteration process, which adds three more nodes (one splitter node and two prediction nodes) to the tree, more boosting iterations will produce larger and more accurate trees. Different from original decision trees, ADTrees perform classification for a sample by mapping all possible paths for which all decision nodes are true, while summing up any prediction nodes that are traversed. In the case of unknown feature values, the ADTree algorithm only considers the reachable decision nodes. That is the reason why the ADTree algorithm can be applied widely in classification.

Validation and Comparison of the Results Obtained by the Models
The validation of the success and predictive performance of the three models was performed based on the receiver operating characteristic (ROC) curves [61][62][63][64][65]. The estimated AUC values range between 0.50 and 1.00 and can be classified based on a quantitative-qualitative classification scheme as follows: 0.5-0.6 (poor), 0.6-0.7 (average), 0.7-0.8 (good), 0.8-0.9 (very good), and 0.9-1 (excellent) [66]. In addition to the AUC values, two evaluation statistics, namely standard error (SE) and confidence interval (CI) at 95%, were also estimated. The best model has the smallest standard error, and the narrowest CI [67,68].

Data Used
A crucial aspect in groundwater spring potential mapping process is to identify spring locations. Based on extensive field surveys conducted during 2006-2017, 66 springs were detected in the study area (Figures 1 and 3a,b). An equal number of 66 nonspring locations were randomly selected from the free of spring's space by applying the Create Random Points function found in the Data Management Tools in the ArcGIS platform [69]. The spring and nonspring locations were randomly divided into two subsets, by using the Subset tool in the Geo statistical extension package of the ArcGIS platform [69]. The first subsets consisted of 46 spring and 46 nonspring locations, 70% of the total number of springs and nonspring areas and were used for training, whereas the second subset consisted of the remaining 30% (20 spring and 20 nonspring locations) and were used for validation.

Data Used
A crucial aspect in groundwater spring potential mapping process is to identify spring locations. Based on extensive field surveys conducted during 2006-2017, 66 springs were detected in the study area (Figures 1 and 3a,b). An equal number of 66 nonspring locations were randomly selected from the free of spring's space by applying the Create Random Points function found in the Data Management Tools in the ArcGIS platform [69]. The spring and nonspring locations were randomly divided into two subsets, by using the Subset tool in the Geo statistical extension package of the ArcGIS platform [69]. The first subsets consisted of 46 spring and 46 nonspring locations, 70% of the total number of springs and nonspring areas and were used for training, whereas the second subset consisted of the remaining 30% (20 spring and 20 nonspring locations) and were used for validation. Generally, several spring explanatory factors may influence spring occurrence, however, there are no guidelines for the selection of spring explanatory factors. Therefore, in the present study and based on the experienced gained from previous studies, 14 spring explanatory factors, including slope aspect, slope angle, plan curvature, profile curvature, elevation, stream power index (SPI), sediment transport index (STI), topographic wetness index (TWI), distance to streams, distance to roads, normalized difference vegetation index (NDVI), lithology, soil, and land use, were selected and prepared for further analysis within a GIS environment [43,70,71]. Eight geomorphometric factors, including slope aspect, slope angle, plan curvature, profile curvature, elevation, SPI, STI, and TWI, were extracted from the ASTER GDEM version 2 sensor (http://www.jspacesystems.or.jp/ersdac/GDEM/E/index.html) with a resolution of 30 m. These spring explanatory factors were reclassified into categories (Table 2) based on the outcomes of frequency analysis concerning spring occurrence and also characteristics of the study area. The distance-to-streams and distance-to-roads maps were produced using the topographic maps at 1:10,000-scale. The NVDI was calculated using Landsat 8 OLI (path/row 126-33) obtained on 4 November 2017 (available at http://www.gscloud.cn). A lithological map was extracted from the geological map at a scale of 1:10,000 and constructed with nine classes based on lithological similarities [43,72]. The soil types were extracted from soil maps at 1:1,000,000-scale in the study area and were classified into four classes [43,73]. In addition, the land use map was extracted from land use maps at 1:100,000-scale with six land use types based on the supervised classification method and maximum likelihood algorithm [19]. All the spring explanatory factors were finally converted into the same spatial resolution of 30 × 30 m 2 ( Figure 4). Generally, several spring explanatory factors may influence spring occurrence, however, there are no guidelines for the selection of spring explanatory factors. Therefore, in the present study and based on the experienced gained from previous studies, 14 spring explanatory factors, including slope aspect, slope angle, plan curvature, profile curvature, elevation, stream power index (SPI), sediment transport index (STI), topographic wetness index (TWI), distance to streams, distance to roads, normalized difference vegetation index (NDVI), lithology, soil, and land use, were selected and prepared for further analysis within a GIS environment [43,70,71]. Eight geomorphometric factors, including slope aspect, slope angle, plan curvature, profile curvature, elevation, SPI, STI, and TWI, were extracted from the ASTER GDEM version 2 sensor (http://www.jspacesystems.or.jp/ersdac/GDEM/E/index.html) with a resolution of 30 m. These spring explanatory factors were reclassified into categories (Table 2) based on the outcomes of frequency analysis concerning spring occurrence and also characteristics of the study area. The distance-to-streams and distance-to-roads maps were produced using the topographic maps at 1:10,000-scale. The NVDI was calculated using Landsat 8 OLI (path/row 126-33) obtained on 4 November 2017 (available at http://www.gscloud.cn). A lithological map was extracted from the geological map at a scale of 1:10,000 and constructed with nine classes based on lithological similarities [43,72]. The soil types were extracted from soil maps at 1:1,000,000-scale in the study area and were classified into four classes [43,73]. In addition, the land use map was extracted from land use maps at 1:100,000-scale with six land use types based on the supervised classification method and maximum likelihood algorithm [19]. All the spring explanatory factors were finally converted into the same spatial resolution of 30 × 30 m 2 ( Figure 4).     Table 2 illustrates the average merit (AM) values of the 14 spring explanatory factors based on the LSVM algorithm classifier using a 10-fold cross-validation method [53]. The results of the performed LSVM analysis revealed that lithology had the highest predictive power (14.0), followed by elevation (12.8), SPI (12.2), and soil cover (10.6), thus being the most significant factors that contribute to the predictive performance of a model. Since, all spring explanatory factors appear to have a positive predictive value, none of them were excluded from the analysis that followed.   Table 2 illustrates the average merit (AM) values of the 14 spring explanatory factors based on the LSVM algorithm classifier using a 10-fold cross-validation method [53]. The results of the performed LSVM analysis revealed that lithology had the highest predictive power (14.0), followed by elevation (12.8), SPI (12.2), and soil cover (10.6), thus being the most significant factors that contribute to the predictive performance of a model. Since, all spring explanatory factors appear to have a positive predictive value, none of them were excluded from the analysis that followed.

Correlation Analysis between Springs and Explanatory Factors Using FR
The correlation between groundwater springs and explanatory factors using the FR values is illustrated in Table 3. Based on the results, springs are found more frequently in southeast-facing (1.989) and south-facing (1.338) slopes. Flat slopes, with no springs occurrence, have the lowest FR value (0.000). For slope angle, FR values increase with the increasing slope angles and then decrease when slope angles are larger than 25 • in the study area, and the class of 20-25 has the highest FR value (1.     Figure 5 illustrates the spring potential map constructed by the KLR method. Based on the visual inspection of the produced spring potential maps, the occurrence of spring appears to follow the spatial distribution of elevation and the factor distance to streams. The high and very high potential groundwater spring zones cover mainly the central and north areas, whereas the south area exhibits low to very low values. The high spring potential class was estimated to cover 5.02% of the study area, whereas low and very low spring potential classes cover 77.71% of the area (  Figure 5 illustrates the spring potential map constructed by the KLR method. Based on the visual inspection of the produced spring potential maps, the occurrence of spring appears to follow the spatial distribution of elevation and the factor distance to streams. The high and very high potential groundwater spring zones cover mainly the central and north areas, whereas the south area exhibits low to very low values. The high spring potential class was estimated to cover 5.02% of the study area, whereas low and very low spring potential classes cover 77.71% of the area (Table 4).  To enhance the performance of the RF method, a tuning process that is based on the grid search  To enhance the performance of the RF method, a tuning process that is based on the grid search method was necessary [74]. The results of the tuning process indicated the optimal parameters to be for ntree 1500 trees and for the mtry parameter of 11. The implementation of RF also provided some extra information concerning the importance of the spring explanatory factors on the overall spring potential mapping. This was achieved by calculating the mean decrease accuracy and the mean decrease Gini [75] (Figure 6). Higher values for both measures indicate that the factor is relatively more significant [76]. According, to those two metrics, the most important factor was lithology, followed by elevation.

Application of KLR, RF, and ADTree Models
Appl. Sci. 2020, 10, x FOR PEER REVIEW 15 of 23 extra information concerning the importance of the spring explanatory factors on the overall spring potential mapping. This was achieved by calculating the mean decrease accuracy and the mean decrease Gini [75] (Figure 6). Higher values for both measures indicate that the factor is relatively more significant [76]. According, to those two metrics, the most important factor was lithology, followed by elevation.  Figure 7 illustrates the groundwater spring potential map constructed by the RF model. Based on the visual inspection of the spring potential map, it could be concluded that spring occurrence follows in this case the spatial pattern of the stream network, with high and very high potential zones covering mainly the central area, whereas the south area illustrates low to very low values. The high spring potential class covers 7.01% of the area, whereas low and very low spring potential classes cover 68.47% of the area (Table 4).  Figure 7 illustrates the groundwater spring potential map constructed by the RF model. Based on the visual inspection of the spring potential map, it could be concluded that spring occurrence follows in this case the spatial pattern of the stream network, with high and very high potential zones covering mainly the central area, whereas the south area illustrates low to very low values. The high spring potential class covers 7.01% of the area, whereas low and very low spring potential classes cover 68.47% of the area (Table 4). Figure 8 shows the spring potential map constructed by the ADTree model based on the natural break method [77,78]. Compared to the previous methods, the ADTree method provides a rather different spatial distribution. The high spring potential class covers 9.15% of the area, whereas low and very low spring potential classes cover 82.56% of the area. It seems that the ADTree method distinguishes with clarity the potential nonspring and spring areas compared to the other two methods.  Figure 8 shows the spring potential map constructed by the ADTree model based on the natural break method [77,78]. Compared to the previous methods, the ADTree method provides a rather different spatial distribution. The high spring potential class covers 9.15% of the area, whereas low and very low spring potential classes cover 82.56% of the area. It seems that the ADTree method distinguishes with clarity the potential nonspring and spring areas compared to the other two methods.   Figure 8 shows the spring potential map constructed by the ADTree model based on the natural break method [77,78]. Compared to the previous methods, the ADTree method provides a rather different spatial distribution. The high spring potential class covers 9.15% of the area, whereas low and very low spring potential classes cover 82.56% of the area. It seems that the ADTree method distinguishes with clarity the potential nonspring and spring areas compared to the other two methods.  Figure 9a,b illustrates the ROC plot assessment results based on the training and validation subsets. The AUC value for the success rate curve using the RF model was estimated to be 0.909, which corresponds to a prediction accuracy of 90.90%, followed by the KLR (0.877) and ADTree (0.812) models.

Validation and Comparison
Appl. Sci. 2020, 10, x FOR PEER REVIEW 17 of 23 Figure 8. Groundwater spring potential map by the ADTree model. Figure 9a,b illustrates the ROC plot assessment results based on the training and validation subsets. The AUC value for the success rate curve using the RF model was estimated to be 0.909, which corresponds to a prediction accuracy of 90.90%, followed by the KLR (0.877) and ADTree (0.812) models. RF showed the lowest SE value (0.0225), followed by KLR (0.0294) and ADTree (0.0341), and also the shorter CI value (0.088) followed by KLR (0.1) and ADTree (0.118) ( Table 5). Similar performance patterns were estimated when using the validation subset, with the AUC value for the predictive rate curve using the RF model estimated at 0.811, followed by the KLR (0.797) and ADTree (0.773) models. Again, the RF model showed the lowest SE (0.0526), followed by ADTree (0.0578) and KLR (0.0591). As for the CI values, RF showed the shortest interval (0.183), followed by KLR (0.188) and ADTree (0.195) ( Table 6). Based on the validation analysis, all three models appear to provide good accuracy, with the RF model producing slightly better results in term of AUC values, low SE, and short CI values for both the training and validation subsets. Concerning the performance of KLR based on the training subset, it was found that it provides results relatively close to the RF model AUC, SE, and CI values.

Discussion
As several studies report, the significance and predictive power of spring related factors that are used in groundwater spring potential assessments are controlled by the geological, morphological, RF showed the lowest SE value (0.0225), followed by KLR (0.0294) and ADTree (0.0341), and also the shorter CI value (0.088) followed by KLR (0.1) and ADTree (0.118) ( Table 5). Similar performance patterns were estimated when using the validation subset, with the AUC value for the predictive rate curve using the RF model estimated at 0.811, followed by the KLR (0.797) and ADTree (0.773) models. Again, the RF model showed the lowest SE (0.0526), followed by ADTree (0.0578) and KLR (0.0591). As for the CI values, RF showed the shortest interval (0.183), followed by KLR (0.188) and ADTree (0.195) ( Table 6). Based on the validation analysis, all three models appear to provide good accuracy, with the RF model producing slightly better results in term of AUC values, low SE, and short CI values for both the training and validation subsets. Concerning the performance of KLR based on the training subset, it was found that it provides results relatively close to the RF model AUC, SE, and CI values.

Discussion
As several studies report, the significance and predictive power of spring related factors that are used in groundwater spring potential assessments are controlled by the geological, morphological, hydrological, and climatic settings of the area [19,22,[79][80][81]. According to Ozdemir [35], topographic features, such as elevation and slope, have a negative influence with groundwater spring potential, and on the other hand TWI and drainage density have a positive influence. Similar studies, report that topographic features along with the characteristics of the soil cover, tectonic features (fault density and distance to faults), and also hydrological features (drainage density) influence the rainfall-runoff rate and also the infiltration rate, thus possibly affecting the groundwater spring potential occurrence [19,35]. Chen et al. [43] reported that lithology, elevation, and distance to streams had a greater influence, whereas land use, NDVI, plan, and profile curvature appear to have the least influence.
During the present study, the implementation of LSVM revealed that lithology had the highest predictive power, followed by elevation, SPI, and soil cover. Concerning the lithology factor, lithological and structural differences lead to variations in the durability and permeability of rock and soil formations, and thus the presence of springs [35]. Based on the FR analysis, groundwater springs are more probable to be found in southeast facing slopes, in areas with slopes angles ranging between 15 to 25 degrees and elevation lower than 1150 m. Concerning slope angles, the outcomes of the study are persistent with previous studies that report that areas with slopes greater than 35 • are considered unfavorable since as the slope increases so too does runoff, having as a result reduced infiltration rates [82,83]. Moreover, the most spring-probable areas are covered by Haplic Arenosols (ARh) soils, which are coarsely textured sandy soils, permeable to water, and Calcic Kastanozems (KSk) soils, which are characterized by a rather restricted water transmission with higher portions of clay particles. According to Srivastava and Bhattacharya [84], sandy soils and coarse sandy clays appear as potential favorable storage bodies due to their light texture and excellent rate of infiltration, which is persistent with the findings of our study.
Within the research area, sand, mudstone, and sandstone formations appear to be more likely to contain springs. Similar findings were found by the authors in a previous study concerning the area of research [43]. Mudstone layers, which could be defined as formations with very low infiltration capacity, form an impermeable layer while sand and sandstone formations act as permeable layers allowing the concentration of surface water within their mass. The alternation of these layers permits the formation of groundwater springs as can be found in the area of research.
An interesting point that should be mentioned is the high predictive value of the factor distance to roads. The distance-from-road network is considered to have an influence on the occurrence of groundwater springs since its presence can cause local hydrological and erosion issues while affecting indirectly the groundwater table [85]. Also, the presence of a road may influence the amount of soil moisture but also the infiltration rate as a result of the removal of geological formations and the disturbance of the surface during of the construction phase [43,85].
Concerning the validation and comparison of the three models (KLR, RF, and ADTree), the RF model appears to provide slightly higher AUC values, lower SE values, and shorter CI intervals than the other two methods. Several studies have indicated that RF models have higher accuracy, compared to other models. According to Naghibi et al. [12], who applied support vector machine (SVM), RF, and genetic algorithm optimized RF (RFGA) methods to assess groundwater potential by spring locations, RF and optimized RF models outperformed SVM models. According to Golkarian et al. (2018) [86], this could be attributed to the methodological approach they followed, which involves aggregating the outcomes of many decision trees in order to limit overfitting effects as well as to limit error due to bias and error due to variance, thus producing more accurate predictions. However, other studies report that the performance of RF models could be influenced by the presence of datasets with noisy data and by the presence of data that includes categorical variables with different numbers of levels where, in such a case, RF models are biased in favor of those variables that appear with more levels [36]. In the present study, KLR gave more accurate results than those from the ADTree model. In similar studies concerning landslide susceptibility assessments, which implemented KLR and ADTree methods, it was found that KLR produced more balanced results for the training and validation datasets in terms of the statistical index, while the ADTree models showed significant variance [74]. Finally, although the presented models appear to have satisfactory predictive performance, it must be kept in mind that their results are influenced by the quality and quantity of the available input, and also the identification of nonspring areas. Concerning future work, the presented approach could be applied to an area with different geo-environmental settings or include in the analysis dynamic variables, such as precipitation and temperature, that may vary over short timeframes, so as to estimate the efficiency of the proposed models.

Conclusions
In the present study, three artificial intelligence methods (KLR, RF, and ADTree) were utilized for the generation of a groundwater spring potential map for the Ningtiaota region, which is located in the northern territory of Shaanxi Province, China. A linear support vector machine method was used as a feature selection method so as to determine the optimal set of factors, which included fourteen explanatory factors (elevation, slope, aspect, plan curvature, profile curvature, stream power index, sediment transport index, topographic wetness index, distance to streams, distance to roads, NDVI, lithology, soil, and land use). The performed analysis highlighted the higher predictive power of the spring explanatory factors lithology, elevation, SPI, and soil cover. These four factors significantly influence the prediction accuracy. The comparison between the performances of KLR, RF, and ADTree models revealed that the RF model had higher prediction accuracy than the other two models, based on the results of higher values of AUC metric, lower SE values, and shorter CI intervals. The RF model's ability to limit overfitting effects may be the reason for its higher predictive performance. While remembering that the results obtained by tree-based artificial intelligence approaches could be influenced by the quality and quantity of data, overall they could be appreciated as accurate and reliable investigation tools in groundwater spring potential assessments.
Author Contributions: W.C., Y.L., P.T., H.S., I.I., W.X., and H.B. contributed equally to the work. W.C., Y.L., W.X., and H.B. collected field data and conducted the analysis. W.C., Y.L., W.X., and H.B. wrote the manuscript. P.T., H.S., and I.I., edited the manuscript. All the authors discussed the results and revised the manuscript. The authors specially wish to thank Enke Hou for useful information provided. All authors have read and agreed to the published version of the manuscript.