Evaluation of Logistic Regression and Multivariate Adaptive Regression Spline Models for Groundwater Potential Mapping Using R and GIS

This study mapped and analyzed groundwater potential using two different models, logistic regression (LR) and multivariate adaptive regression splines (MARS), and compared the results. A spatial database was constructed for groundwater well data and groundwater influence factors. Groundwater well data with a high potential yield of ≥70 m3/d were extracted, and 859 locations (70%) were used for model training, whereas the other 365 locations (30%) were used for model validation. We analyzed 16 groundwater influence factors including altitude, slope degree, slope aspect, plan curvature, profile curvature, topographic wetness index, stream power index, sediment transport index, distance from drainage, drainage density, lithology, distance from fault, fault density, distance from lineament, lineament density, and land cover. Groundwater potential maps (GPMs) were constructed using LR and MARS models and tested using a receiver operating characteristics curve. Based on this analysis, the area under the curve (AUC) for the success rate curve of GPMs created using the MARS and LR models was 0.867 and 0.838, and the AUC for the prediction rate curve was 0.836 and 0.801, respectively. This implies that the MARS model is useful and effective for groundwater potential analysis in the study area.


Introduction
Groundwater is defined as water in the saturated zone that fills the pore spaces between mineral grains and the cracks and fractures within a rock mass [1].It results from the interactions of climatic, geological, hydrological, physiographical, and ecological factors [2].Globally, groundwater makes up 50% of the present potable water supplies, 40% of the industrial water demand, and 20% of the water used for irrigation [3].Therefore, it is not only an essential element of life, but also an essential natural resource.Due to the rapid population increase and economic development, the demand for groundwater resources for agricultural, industrial, and potable uses has been increasing [4].Because groundwater is a limited resource, it is necessary to devise effective and efficient plans to use it based on an understanding of the behavior of groundwater systems and identification of the current status of the local groundwater system through groundwater exploration [5].
Traditional methods of exploring groundwater, which is a hidden natural resource, include drilling, geophysical, geological, and hydro-geological methods.However, such methods entail large expenses and the use of time and human resources for field surveys [6,7].Groundwater potential maps (GPMs), based on geographic information system (GIS) and remote sensing (RS) data, have been widely used to solve this problem.GIS offers suitable alternatives for the effective management of large and complex geospatial databases [8].In addition, it can be useful for groundwater exploitation and groundwater resource conversion, as it provides insights into the future availability of groundwater resources [9,10].
As shown in previous studies, diverse data mining techniques can be employed, but few such studies have been performed.Very few studies have attempted to create GPMs using the MARS model.The purpose of this study was to use the MARS model, a data mining technique, with the widely used LR model to create GPMs.The model performance of these GPMs was comparatively analyzed using receiver operating characteristic (ROC) curves.The ultimate aim was to evaluate the efficacy of the MARS model for creating GPMs.

Study Area
This study was conducted at a site in Buyeo-gun, Chungcheongnam-do, Korea, with a surface area of 625 km 2 , located between 127 • 03 and 126 • 44 east longitude and 36 • 04 and 36 • 23 north latitude (Figure 1).The total population of Buyeo-gun was 71,143 in 2015, of which 31.2% or 22,213 individuals were engaged in farming [44].The elevation of the study area is 0-640 m, and 72.8% of the overall area is formed as lowland with an elevation of 100 m or less.The study area is a basin with a high temperature and large daily temperature range in the summer, as well as large amounts of dew and fog due to the influence of the Geumgang River.Based on 2015, the annual precipitation is 848.8 mm and more than half of the annual precipitation occurs in the summer.The annual mean temperature is 12.9 • C, with a maximum temperature of 35.8 • C in the summer and a minimum winter temperature of −14.2 • C [44].In terms of ground cover, most of the study area is composed of agricultural (40.1%) and forest (47.0%) areas.The lithology indicates that 52.05% of the study area is covered with metamorphic rock.This study area contains one river and 51 streams.As of 2015, the water and sewer distribution rates in the study area were 73.6% and 50.6%, respectively.In 2015, Buyeo-gun used 35,899,226 m 3 of groundwater annually, which is about 8% of the total groundwater use of Chungcheongnam-do (475,376,469 m 3 /year).Most of this water is used for farming (about 71%; [45]).Considering that other cities and districts in Chungcheongnam-do only use about 7% of groundwater annually, Buyeo-gun has a relatively high dependence on groundwater.

Materials and Methods
In this study, a GPM was created through the three major steps described below.The first step was spatial database construction, in which a spatial database was created containing groundwater well locations and groundwater influence factors.The second step was groundwater potential assessment.LR and MARS models were used to analyze the relationships between well location and groundwater influence factors, and a GPM from each model was created for the overall study area.The third step was the validation process.The performance of the GPM created by each model was evaluated using ROC curves.A flow chart of the methodology used in this study is presented in Figure 2.

Materials and Methods
In this study, a GPM was created through the three major steps described below.The first step was spatial database construction, in which a spatial database was created containing groundwater well locations and groundwater influence factors.The second step was groundwater potential assessment.LR and MARS models were used to analyze the relationships between well location and groundwater influence factors, and a GPM from each model was created for the overall study area.The third step was the validation process.The performance of the GPM created by each model was evaluated using ROC curves.A flow chart of the methodology used in this study is presented in Figure 2.

Materials and Methods
In this study, a GPM was created through the three major steps described below.The first step was spatial database construction, in which a spatial database was created containing groundwater well locations and groundwater influence factors.The second step was groundwater potential assessment.LR and MARS models were used to analyze the relationships between well location and groundwater influence factors, and a GPM from each model was created for the overall study area.The third step was the validation process.The performance of the GPM created by each model was evaluated using ROC curves.A flow chart of the methodology used in this study is presented in Figure 2.

Well Data
The groundwater well data used in this study were collected from extensive field surveys and governmental reports.Well water was used for a variety of purposes including livestock, farming, and human drinking water.The groundwater yield was calculated from the results of a pumping test of a groundwater well.The groundwater potential was based on the prediction of the best potential for groundwater extraction in the study area [9].Based on previous studies and groundwater productivity reports, an actual pumping test was conducted using the groundwater well data for the study area.The high productivity value was based on a yield value ≥70 m 3 /d.The groundwater productivity data from 1224 wells were selected and randomly divided into a training dataset containing 70% or 859 wells and a validation dataset with 30% or 365 wells.In addition, it was necessary to obtain sampling data from areas without groundwater wells.The data for the same number of groundwater wells (859) were selected as non-well occurrence data and allocated a value of 0 for application to the LR and MARS models.Figure 1 shows the locations of the groundwater well data used in this study.

Groundwater Influencing Factors
As presented in Table 1 and Figure 3, 16 groundwater influence factors were used in this study.These factors were largely divided into topographical, hydrological, geological, and land cover factors.Groundwater influence factors were created using ArcGIS 10.2 software (ESRI, Redlands, CA, USA), and were converted into a raster file with a spatial resolution of 10 × 10 m prior to use for groundwater potential assessment.
Topographic factors included altitude, slope degree, slope aspect, plan curvature, and profile curvature.Areas with different elevations have notable differences in weather and climatic conditions, which lead to differences in soil conditions and vegetation [46].The slope degree is a factor that is mainly used to determine groundwater recharge processes, as gentle slope areas have a low surface runoff and high rates of percolation, while the opposite is true for high slope areas [19].The slope aspect is a factor related to precipitation direction and physiographic trends, and it affects the soil water content [12].Curvature represents the morphology of the topography, and is composed of three aspects: profile, plan, and total, the latter of which combines profile and plan.Profile curvature and plan curvature mainly affect the acceleration and deceleration of flow, as well as flow convergence and divergence, on the ground surface [22].These factors were extracted from the digital elevation model (DEM) using a spatial analyst tool of ArcGIS 10.2 software.The DEM was created using contour lines and points extracted from a 1:5000 digital map provided by the Korean National Geographic Information Institute.The ArcGIS triangular irregular network (TIN) module was used for this process, and the generated TIN was converted into a raster file with a pixel size of 10 m.
Hydrologic factors such as the topographic wetness index (TWI), stream power index (SPI), sediment transport index (STI), distance from drainage, and drainage density were considered when estimating the flow of surface water and groundwater according to topographical factors.TWI is a secondary topographic index that has been used to describe spatial moisture patterns and explain the effects of topographic conditions on these patterns [47].It plays an important role in influencing the movement and accumulation of runoff on the ground surface [11].SPI is a factor that estimates the degree of slope erosion due to flowing water.TWI and SPI are calculated using the following equations [47]: where A s is the cumulative upslope area and β is the slope gradient.STI combines slope steepness and slope length, and is used to measure the sediment transport capacity of overland flow within the universal soil loss equation [48].STI is calculated using the following equation [49]: where A s is the cumulative upslope area and β is the slope gradient.Drainage lines were used to create a drainage density map and distance map of the study area.Drainage density has an inverse relationship with permeability.A high drainage density decreases infiltration and increases surface runoff, and is therefore not appropriate for groundwater development [50].The drainage density is calculated by dividing the surface area (km 2 ) by the sum of drainage lengths (km) for the corresponding cell.The drainage density and distance from drainage were determined using the line density tool and Euclidean distance tool in ArcGIS 10.2 software, respectively.Geological factors affect the porosity and permeability of aquifer materials, and are thus considered indicators of hydrological features.Geological factors are composed of lithology, distance from fault, fault density, distance from lineament, and lineament density.These factors were determined using a digital geological map at a 1:50,000 scale, obtained from the Korea Institute of Geoscience and Mineral Resources.The study area was divided into 37 lithology units according to the type of lithology and geological age.In this study, these lithology units were classified according to their characteristics into metamorphic rock, sedimentary rock A, sedimentary rock B, igneous rock, and dike and talus.Here, sedimentary rocks were classified based on permeability.Sedimentary rock A is permeable rock made of sandstone or gravel, whereas sedimentary rock B is impermeable rock made of shale or clay.Faults extracted from the digital geology map were used to determine the distance from the fault and fault density.A lineament is defined as a straight or slightly curved surface feature of natural origin directly observed from the image [51,52].Because lineaments are related to discontinuities such as joints, faults, and folds, they have been used for structural analysis, lithological relationship analysis, and groundwater productivity assessment [13].In this study, lineaments were extracted from a hill-shaded map created from the DEM using the geological and geophysical analysis tool in Geomatica 2016 software (PCI Geomatics, Markham, ON, Canada).The hill-shaded map was created by combining images in three directions where the sun altitude was 45 • and the sun azimuth was 45 • , 90 • , or 135 • .The extracted lineament lines were used to determine the distance from the lineament and lineament density using the Euclidean distance tool and line density tool in ArcGIS 10.2 software.
shaded map created from the DEM using the geological and geophysical analysis tool in Geomatica 2016 software (PCI Geomatics, Markham, ON, Canada).The hill-shaded map was created by combining images in three directions where the sun altitude was 45° and the sun azimuth was 45°, 90°, or 135°.The extracted lineament lines were used to determine the distance from the lineament and lineament density using the Euclidean distance tool and line density tool in ArcGIS 10.2 software.Land cover represents the biological state of a geographic feature on the surface of the earth, and has been used to demarcate groundwater availability [22].The type of land cover contributes to variation in the soil condition, and discontinuities resulting from this variation affect the occurrence, storage, and movement of groundwater.In this study, land cover factors were constructed using a digital land cover map provided by the Ministry of Environment.The digital land cover map was prepared on a 1:25,000 scale using Korea Multi-Purpose SATellite-2, and this map included 22 land cover categories that were reclassified into seven groups including urban, farmland, forest, grassland, wetland, bare land, and water.

Groundwater Potential Mapping
In this study, groundwater potential assessment was analyzed using the LR and MARS models.Analysis of the LR and MARS models was performed using the "glm" and "earth" packages in R 3.3.0software (R Foundation for Statistical Computing, Vienna, Austria), respectively.Land cover represents the biological state of a geographic feature on the surface of the earth, and has been used to demarcate groundwater availability [22].The type of land cover contributes to variation in the soil condition, and discontinuities resulting from this variation affect the occurrence, storage, and movement of groundwater.In this study, land cover factors were constructed using a digital land cover map provided by the Ministry of Environment.The digital land cover map was prepared on a 1:25,000 scale using Korea Multi-Purpose SATellite-2, and this map included 22 land cover categories that were reclassified into seven groups including urban, farmland, forest, grassland, wetland, bare land, and water.

Groundwater Potential Mapping
In this study, groundwater potential assessment was analyzed using the LR and MARS models.Analysis of the LR and MARS models was performed using the "glm" and "earth" packages in R 3.3.0software (R Foundation for Statistical Computing, Vienna, Austria), respectively.

Logistic Regression
The LR model is a type of multivariate regression used to explain the relationships among a dichotomous dependent variable coded into 0 and 1, and one or more categorical or numerical independent variables [53].In this study, the dependent variable was a binary variable indicating the presence or absence of groundwater wells, with a value of 1 or 0, respectively.Independent variables were the groundwater influencing factors that affect the groundwater wells.In general, the LR model can be expressed as follows [14,21]: where P is the probability of an occurrence and Z is the linear combination function of the independent variables showing a linear relationship.Z can be expressed as follows: where α is the intercept, n is the number of independent variables, β n represents the regression coefficients that represent the contribution of each independent variable to the probability value (P), and x n represents the independent variables.In Equation ( 5), the value of Z ranges from −∞ to +∞.
Positive regression coefficients indicate that the dependent variable has a positive correlation with the independent variables, whereas negative regression coefficients indicate that the independent variables have a negative effect on the dependent variable.However, if the dependent variable is a binary variable divided into presence and absence, as in this study, the value of the dependent variable is coded as 1 and 0. Thus, the predicted value of the dependent variable is a probability estimate.
The probability value has an upper limit of 1 and a lower limit of 0, and the relationship between the dependent variable and this probability cannot be expressed as a linear function.Therefore, the upper and lower limits of the probability are removed by converting the probability into a logit function.
The relationship between the dependent variable and logit can be expressed as a linear function.
Probability can be converted into a logit function using the following equation: where P 1−P is the odds or likelihood ratio representing the ratio between the probability P that the dependent variable is present and the probability 1 -P that the dependent variable is absent.The natural logarithm of odds is called logit(P) and is a linear function of the independent variables ranging from −∞ to +∞.To be more precise, if the value of the probability P increases, the value of logit(P) also increases [14].

Multivariate Adaptive Regression Splines Model
The MARS model is a statistical method introduced by Friedman (1991) that is used to fit the relationship between the dependent and independent variables.It is a nonlinear and nonparametric regression method that combines classic linear regression, the mathematical construction of splines, binary recursive partitioning, and brute and intelligent algorithms [54,55].A benefit of this method is that specific assumptions of the underlying functional relationship between the independent and dependent variables are unnecessary [56,57].The MARS model predicts a function using linear combinations and interactions of the adaptive piecewise linear regression known as the "basis function (BF)".Accordingly, f (X) of the MARS model can be expressed by the following equation [42,57]: where β 0 is a constant, β i is the coefficient of the ith BF, λ i (X) is a BF, and n is the number of BFs in the model.All of the coefficients were estimated using the least-squares method.The BFs are functions that take the following form [58]: where x is an independent variable and α is a constant corresponding to a knot (or hinge).Two adjacent splines intersect at the knot to maintain the presence of the BF [33].The MARS model was developed in two steps: the forward stepwise algorithm and the backward stepwise algorithm.The first step, the forward stepwise algorithm, adds BFs to Equation ( 1) and finds potential knots to obtain a better model performance.However, obtaining too many BFs in this process can result in overfitting the MARS model.The second step, the backward stepwise algorithm, is used to lessen this problem.In this step, redundant BFs that have the smallest contributions to the model are removed from the BFs used in the forward stepwise algorithm to find the best sub-model.Generalized cross-validation (GCV) is used to remove redundant BFs from the MARS model, and is calculated as follows [23,37]: where N is the amount of data, f (X i ) is the predicted value of the MARS model, M is the number of BFs, and d is the penalizing parameter.If the value of d is large, it can result in a small number of knots being used.The optimum value of d is considered to be in the range of 2 ≤ d ≤ 4 [56].In this study, a default value of 3 was used for d.

Preliminary Analysis
In this study, multicollinearity and FRs were analyzed as preliminary analyses prior to groundwater potential assessment.

Multicollinearity Analysis
Multicollinearity refers to a linear relationship that exists among two or more variables.If multicollinearity exists among the independent variables during regression analyses, the variance of the regression coefficient increases.Error also increases with multicollinearity, reducing the accuracy of the model's prediction.Multicollinearity can be assessed by various means, and tolerance (TOL) and variance inflation factor (VIF) assessments were used in this study.TOL and VIF indicate multicollinearity among independent variables when their values are ≤0.1 and ≥10, respectively [59].The results of the multicollinearity analysis on the 16 independent variables used in this study are presented in Table 2.This analysis showed that the TOL and VIF of all variables used in this study were ≥0.1 and ≤10, respectively.This suggests that there was no problem of multicollinearity among the independent variables used in this study, so LR analyses were performed with all of the variables.The FR was used to analyze the probabilistic relationship between groundwater wells and groundwater influence factors in this study.FR is defined as the ratio between areas in which groundwater wells occur to the total study area, and is calculated using the following equation [18]: where a is the number of pixels with groundwater wells for each groundwater influence factor, b is the total number of groundwater wells in the study area, c is the number of pixels in the factor's class, and d is the total number of pixels in the study area.FR is considered to show an average relationship if its value is 1, high correlation if larger than 1, and low correlation if lower than 1 [60].Among groundwater influence factors, FR values for continuous factors (e.g., altitude) were calculated after dividing the values into nine interclasses by quantile classification.
The results of the FR analysis are presented in Table 3.The value of FR for altitude was larger than 1 in classes 7-59.The value of FR was larger than 1 when the slope was 10.76 degrees or below, and was highest at 2.92, in the 2.96-6.72 class.Regarding the slope, the values of FR were relatively high for flat, northeast-facing, southeast-facing, and southwest-facing slopes.The value of FR for plan curvature and profile curvature was highest in the flat class and lowest in the convex class.Regarding the TWI, the value of FR was high at 2.42 and 2.09 in the −5.20-0.57and 3.45-5.01classes, respectively, and low at 0.19 in the −7.84-−5.2class.SPI had an FR value larger than 1 when the value of SPI was lower than −1.60, except in the −8.60-−6.26class.The FR value for STI was high at 1.16 and 1.14 in the 0 and 1-14.80 classes, respectively.Regarding the distance from drainage, the FR value was highest at 1.17 in the 917.47-1190.76and 1483.56-1834.94classes and lowest at 0.83 in the 0-195.206class.For drainage density, the highest FR value was found in the 0.18-0.73class.For lithology, FR was larger than 1 in the igneous rock and dike and talus classes, indicating a high probability of well occurrence.In the case of distance from a fault, the value of FR was highest at 2.07 in the 2997.87-3997.16 class and lowest at 0.41 in 0.00-777.23 class.The value of FR for fault density was highest in the 0 class.For distance from a lineament, the highest FR value of 1.29 was found in the 443.97-624.85class and the lowest FR value in the 0.00-131.55class (0.58).For lineament density, the highest FR value was found in the 0.46-0.60class (1.31) and lowest in the >1.09 class (0.58).For land cover, the values of FR were high at 2.48, 1.95, and 1.79 in the urban, farmland, and bare land classes, respectively, indicating a high probability of well occurrence.Based on the results of FR analysis, the classes of groundwater influence factors used in this study had different FR values, and some factors showed a broad range of values.For example, the slope degree had an FR range of 0.02-2.92.In addition, each groundwater influence factor had at least one class with an FR value larger than 1, showing a high correlation with well occurrence.Therefore, the 16 factors used in this study are appropriate for use as groundwater influence factors.

Application of the Logistic Regression Model
The results of analysis using the LR model are presented in Table 4.Among the independent variables used in this study, factors such as the slope degree, profile curvature 2 (flat), TWI, SPI, distance from drainage, lithology 3 (sedimentary rock B), lithology 4 (igneous rock), fault density, distance from lineament, land cover 3 (forest), land cover 5 (wetland), and land cover 7 (water) had significant effects on groundwater well occurrence at the 5% significance level.The results of the β coefficient, altitude, SPI, distance from drainage, and lineament density also had positive effects on groundwater well occurrence.However, the slope degree, TWI, STI, drainage density, distance from fault, and distance from lineament had negative β coefficient values, indicating negative effects on groundwater well occurrence.For categorical variables such as the slope aspect, plan curvature, profile curvature, lithology, and land cover, the plan curvature 3 (concave), profile curvature 2 (flat), lithology 2 (sedimentary rock A), lithology 4 (igneous rock), lithology 5 (dike and talus), and land cover 6 (bare land) classes had positive effects on groundwater well occurrence.On the other hand, some classes of slope aspect, plan curvature, profile curvature, lithology, and land cover had negative effects on groundwater well occurrence.In addition, the results of the LR model showed a null deviance of 2381.7 with 1717 degrees of freedom.The residual deviance was 1722.9 with 1684 degrees of freedom and an Akaike Information Criterion of 1790.9.

Application of Multivariate Adaptive Regression Splines Model
The optimal MARS model included 25 terms, and the GCV was 0.165.The MARS model generates the optimal model by only selecting the necessary independent variables [55].Of the 16 independent variables included in this study, only 10 variables (altitude, slope degree, distance from drainage, drainage density, lithology, distance from fault, fault density, distance from lineament, lineament density, and land cover) were used to construct the optimal model.Categorical variables, such as lithology and land cover, only included classes, for example, sedimentary rock A, sedimentary rock B, igneous rock, forest, wetland, and water.Based on an analysis of the MARS model, a BF was created for each independent variable and each BF had a different β coefficient.Continuous variables have one or more constants corresponding to a knot within the variable, which lead to different effects on groundwater well occurrence.
In the MARS model, it is possible to estimate the relative importance of variables.The results of the selections and contributions of various independent variables are shown in Table 5.Here, nsubset is a criterion of the number of model subsets that include each variable.Variables that are included in more subsets are considered more important.GCV provides a generalized cross-validation of the model.The GCV criterion first calculates the decrease in the GCV for each subset relative to the previous subset.Then, for each variable, it sums these decreases over all subsets that include that variable.Finally, for ease of interpretation, the summed decreases are scaled so the largest summed decrease is 100.In addition, RSS is the residual sum-of-squares of the mode.In the case of RSS and GCV, variables that cause larger net decreases are considered more important [58,61].
Based on Table 5, land cover 3 (forest) was the most important variable explaining the spatial distribution of groundwater wells in the study area, followed by altitude, slope degree, land cover 7 (water), and distance from the fault.These independent variables had lower values (<0.15) for the frequency ratio compared with other variables.In addition, altitude had no significant effect on groundwater well occurrence at the 5% significance level in the LR model.The influence of the independent variables differed depending on the result of FR, LR, and the MARS model.The following equations were used to apply the analysis results of the LR and MARS models to the creation of a GPM for the overall study area: When the GPM was classified by the groundwater potential zone using four classification techniques including a natural break, quantile, equal interval, and geometrical interval, and when the distribution of training and validation groundwater wells in high and very high zones was comparatively analyzed, the quantile classification technique was most accurate [15].Based on this finding, the GPMs created using the LR and MARS models were classified into very low, low, moderate, high, and very high groundwater potential zones using the quantile classification technique.Figure 4 shows the GPMs created by the LR and MARS models.
When the GPM was classified by the groundwater potential zone using four classification techniques including a natural break, quantile, equal interval, and geometrical interval, and when the distribution of training and validation groundwater wells in high and very high zones was comparatively analyzed, the quantile classification technique was most accurate [15].Based on this finding, the GPMs created using the LR and MARS models were classified into very low, low, moderate, high, and very high groundwater potential zones using the quantile classification technique.Figure 4 shows the GPMs created by the LR and MARS models.The surface area of the GPM created using the LR model in high and very high zones was 248 km 2 , which is 39.7% of the overall surface area.In addition, the surface area of the GPM created with the MARS model in high and very high zones was 251 km 2 (40.1%), which is slightly larger than that of the GPM created using the LR model.Comparing the surface areas of the GPM in each zone, the difference in the surface area of the very high zone (0.62 km 2 ) was not large.However, the surface The surface area of the GPM created using the LR model in high and very high zones was 248 km 2 , which is 39.7% of the overall surface area.In addition, the surface area of the GPM created with the MARS model in high and very high zones was 251 km 2 (40.1%), which is slightly larger than that of the GPM created using the LR model.Comparing the surface areas of the GPM in each zone, the difference in the surface area of the very high zone (0.62 km 2 ) was not large.However, the surface area of the GPM created with the MARS model in the high zone was 1.78 km 2 larger than that in the GPM created using the LR model, a relatively large difference in surface area.

Validation and Comparison
ROC curves were used to evaluate the performance of the GPMs created in this study.An ROC curve is a scientific technique used to describe the efficiency of probabilistic and deterministic detection and prediction systems [62], and is formed by plotting the trade-off between the true positive rate (sensitivity) on the X-axis and the false positive rate (1-specificity) on the Y-axis.ROC curves can be divided into success rate curves and prediction rate curves according to the dataset used.A success rate curve is formed using a training dataset, and represents how well the model fits the groundwater wells observed.A prediction rate curve is formed using a validation dataset, and represents how well the model predicts groundwater wells [63].The ROC curve can be used as a quantitative measure through the calculation of the area under the curve (AUC).The value of AUC is between 0.5 and 1.0, and a value closer to 1 indicates a model with a better predictive capability.AUC values can be evaluated as follows.Poor: 0.5-0.6,Average: 0.6-0.7,Good: 0.7-0.8,Very good: 0.8-0.9, and Excellent: 0.9-1.0[64].The ROC curves and AUC of GPMs created using the LR and MARS models are shown in Figure 5.
In our success rate curve, the AUC value of the GPMs created using the LR and MARS models were 0.838 and 0.867, respectively.Thus, the AUC value was 0.029 higher for the GPM created using the MARS model compared to the GPM created with the LR model.The AUCs of the prediction rate curves were similar, with the AUC (0.836) of the GPM created with the MARS model being slightly higher than the AUC (0.801) of the GPM created with the LR model.These results, showing that the GPMs created in this study have AUC values of 0.8 or above, indicate an excellent predictive capability for both models, with the MARS model performing better than the LR model.area of the GPM created with the MARS model in the high zone was 1.78 km 2 larger than that in the GPM created using the LR model, a relatively large difference in surface area.

Validation and Comparison
ROC curves were used to evaluate the performance of the GPMs created in this study.An ROC curve is a scientific technique used to describe the efficiency of probabilistic and deterministic detection and prediction systems [62], and is formed by plotting the trade-off between the true positive rate (sensitivity) on the X-axis and the false positive rate (1-specificity) on the Y-axis.ROC curves can be divided into success rate curves and prediction rate curves according to the dataset used.A success rate curve is formed using a training dataset, and represents how well the model fits the groundwater wells observed.A prediction rate curve is formed using a validation dataset, and represents how well the model predicts groundwater wells [63].The ROC curve can be used as a quantitative measure through the calculation of the area under the curve (AUC).The value of AUC is between 0.5 and 1.0, and a value closer to 1 indicates a model with a better predictive capability.AUC values can be evaluated as follows.Poor: 0.5-0.6,Average: 0.6-0.7,Good: 0.7-0.8,Very good: 0.8-0.9, and Excellent: 0.9-1.0[64].The ROC curves and AUC of GPMs created using the LR and MARS models are shown in Figure 5.
In our success rate curve, the AUC value of the GPMs created using the LR and MARS models were 0.838 and 0.867, respectively.Thus, the AUC value was 0.029 higher for the GPM created using the MARS model compared to the GPM created with the LR model.The AUCs of the prediction rate curves were similar, with the AUC (0.836) of the GPM created with the MARS model being slightly higher than the AUC (0.801) of the GPM created with the LR model.These results, showing that the GPMs created in this study have AUC values of 0.8 or above, indicate an excellent predictive capability for both models, with the MARS model performing better than the LR model.

Discussion and Conclusions
Groundwater is an important natural resource, and the spatial distribution of groundwater can be detected and predicted by creating GPMs using various factors to ensure its continued availability.In this study, the LR and MARS models were used to evaluate groundwater potential and create GPMs.Groundwater well data (with high potential yields of ≥70 m 3 /d) were classified into a training dataset (70%, 859 groundwater well locations) and validation dataset (30%, 365 groundwater well locations).This study used 16 groundwater influence factors for groundwater potential assessment, including topographic factors (altitude, slope degree, slope aspect, plan curvature, profile curvature), hydrologic factors (TWI, SPI, STI, distance from drainage, drainage density), geological factors

Discussion and Conclusions
Groundwater is an important natural resource, and the spatial distribution of groundwater can be detected and predicted by creating GPMs using various factors to ensure its continued availability.In this study, the LR and MARS models were used to evaluate groundwater potential and create GPMs.
Groundwater well data (with high potential yields of ≥70 m 3 /d) were classified into a training dataset (70%, 859 groundwater well locations) and validation dataset (30%, 365 groundwater well locations).This study used 16 groundwater influence factors for groundwater potential assessment, including topographic factors (altitude, slope degree, slope aspect, plan curvature, profile curvature), hydrologic factors (TWI, SPI, STI, distance from drainage, drainage density), geological factors (lithology, distance from fault, fault density, distance from lineament, lineament density), and land cover.Groundwater well locations and groundwater influence factors were applied to the LR and MARS models to analyze groundwater potential, and GPMs were created based on the results of this analysis.The accuracy of the models was tested using an ROC curve.The GPMs created in this study all exhibited AUC values of 0.8 or above, indicating an excellent model performance.The AUCs for the success rate and prediction rate curves of the GPM created with the MARS model were, respectively, 0.029 and 0.035 higher than those of the GPM created using the LR model.
It was also possible to estimate the groundwater potential in the MARS model.The results showed that land cover, altitude, slope degree, and distance from a fault made large contributions to groundwater occurrence.According to another study, altitude, TWI, distance from rivers, land cover, fault density, slope, and lithology were important factors [9,20,39].These results were similar to our results.However, in our study, lithology, distance from rivers, and fault density were less important factors.This could be due to the study area conditions and method used.
Based on the results, the MARS model is more robust and has a better predictive capability than the LR model for the evaluation and mapping of groundwater potential in this study area.The MARS model has the following advantages compared to traditional regression-based analysis.The MARS model creates its final results by only selecting the important variables from the multiple variables used in the results [65].This can effectively reduce the time needed for researchers to select the groundwater influence factors during GPM analysis.Even if unnecessary variables are used during this review process, the optimal variables can be selected using the MARS model.In addition, the MARS model is an easy-to-interpret model that can extract complex data in a computationally efficient manner for multivariate problems involving large volumes of data [57].
The GPMs created in this study suggest the possibility of groundwater occurrence in the study area.Through a comprehensive understanding of groundwater potential, GPMs can be used to drive the exploration of groundwater resources effectively and economically, and prevent undesirable effects due to water resource development.Therefore, GPMs can be useful for decision-makers and planners to devise plans for sustainable water resource management, ecologically friendly land use, and environmental preservation of the study area.However, it is still necessary to apply the MARS model in more diverse areas and to compare it with other models to make a reliable judgment of its efficacy.Additional detailed spatial data reflecting geological and hydrogeological conditions should also be used to analyze groundwater potential in the future.

Figure 1 .
Figure 1.Location of the study area.(a) administrative map showing one town and 15 townships; (b) groundwater well locations divided into training and validation datasets.

Figure 2 .
Figure 2. Flow chart of the methodology used in this study.

Figure 1 .
Figure 1.Location of the study area.(a) administrative map showing one town and 15 townships; (b) groundwater well locations divided into training and validation datasets.

Figure 1 .
Figure 1.Location of the study area.(a) administrative map showing one town and 15 townships; (b) groundwater well locations divided into training and validation datasets.

Figure 2 .
Figure 2. Flow chart of the methodology used in this study.Figure 2. Flow chart of the methodology used in this study.

Figure 2 .
Figure 2. Flow chart of the methodology used in this study.Figure 2. Flow chart of the methodology used in this study.

Figure 4 .
Figure 4. Groundwater potential maps produced by the (a) LR and (b) MARS model.

Figure 4 .
Figure 4. Groundwater potential maps produced by the (a) LR and (b) MARS model.

Figure 5 .
Figure 5. Results of model validation for each GPM.(a) success rate and (b) prediction rate curves.

Figure 5 .
Figure 5. Results of model validation for each GPM.(a) success rate and (b) prediction rate curves.

Table 1 .
Data sources used in this study.

Table 2 .
Multicollinearity diagnostic indices for independent variables.

Table 3 .
Spatial relationships between groundwater wells and groundwater influencing factors determined using the frequency ratio model.

Table 4 .
β coefficients of groundwater influence factors used in the logistic regression model.

Table 5 .
The contributions of various independent variables in the MARS model.