Sinkhole Detection and Characterization Using LiDAR-Derived DEM with Logistic Regression

: Depressions due to sinkhole formation cause signiﬁcant structural damages to buildings and civil infrastructure. Traditionally, visual inspection has been used to detect sinkholes, which is a subjective way and time- and labor-consuming. Remote sensing techniques have been introduced for morphometric studies of karst landscapes. This study presents a methodology for the probabilistic detection of sinkholes using LiDAR-derived digital elevation model (DEM) data. The proposed study provides beneﬁts associated with: (1) Detection of unreported sinkholes in rural and / or inaccessible areas, (2) automatic delineation of sinkhole boundaries, and (3) quantiﬁcation of the geometric characteristics of those identiﬁed sinkholes. Among sixteen morphometric parameters, nine parameters were chosen for logistic regression, which was then employed to compute the probability of sinkhole detection; a cuto ﬀ value was back-calculated such that the sinkhole susceptibility map well predicted the reported sinkhole boundaries. According to the results of the LR model, the optimal cuto ﬀ value was calculated to be 0.13, and the area under the curve (AUC) of the receiver operating characteristic curve (ROC) was 0.90, indicating the model is reliable for the study area. For those identiﬁed sinkholes, the geometric characteristics (e.g., depth, length, area, and volume) were computed.


Introduction
Sinkholes can cause serious damage to properties and infrastructure, and sometimes human casualties occur in severe cases. In the United States, economic damage due to sinkholes has been estimated to be over $300 million per year and the actual damage is likely to be much greater than this estimate [1]. Considerable damage from natural sinkholes is particularly common in Florida, Texas, Alabama, Missouri, Kentucky, Tennessee, and Pennsylvania [2]. Insurers in Florida, the most vulnerable state to sinkhole damage, received a total of 24,671 claims for sinkhole damage between 2006 and 2010, totaling $1.4 billion [3]. According to the Florida Office of Insurance Regulation (FLOIR) report, the insurers' expense has been gradually growing with increases in both frequency and severity of sinkholes.
Sinkholes have been classified into two main groups [4,5]. The first group, known as solution sinkholes, involves centripetal flow to areas having the highest permeability and consequent dissolution [6][7][8]. The second group is known as subsidence sinkholes and involves downward movement of overlying soils into cavities within bedrock. Subsidence sinkholes are further classified using two descriptors: The material affected by internal erosion or deformation (cover, bedrock, and caprock) and the main subsidence mechanism (collapse, suffosion, or sagging). Details of sinkhole classification can be found in previous research by Gutiérrez et al. [5].

Study Area
The study area is located in the Springfield Plateau region, the southwestern part of the state of Missouri, and more specifically in Greene County ( Figure 1). It is located between latitudes 37 • 18 00 N and 37 • 19 40 N and longitudes 93 • 20 30 W and 93 • 22 30 W, with an area of 9 km 2 . The estimated terrain elevation above sea level ranges between 327 and 381 m. The region is underlain by thick and well karstified carbonate rocks that develop a variety of karst features, including sinkholes, caves, and springs. The process of sinkhole formation and collapse in this region is due to dissolution and cavity growth in the underlying bedrock as groundwater percolates through voids and cracks in the rock. According to the Geological Survey Program of the Missouri Department of Natural Resources' Missouri Geological Survey, 15,763 sinkholes have been reported in the state until December 2018 and numerous non-reported sinkholes also exist in the region [30]. In the study area, a total of 199 sinkholes of varying sizes have been identified.

Methodology
The flow chart showing the research methodology is presented in Figure 2. This study mainly consists of the following four steps: (1) Data preparation; (2) sinkhole susceptibility modeling using the generalized linear model (GLM) for logistic regression (LR); (3) sinkhole susceptibility mapping to detect sinkhole boundaries; and (4) sinkhole geometric characterization. In the data preparation step, a digital elevation model (DEM) and geomorphometric indices were created. Logistic regression was then selected for the sinkhole susceptibility model. A sinkhole susceptibility map was created through the cutoff value identified. Once the boundaries of sinkholes were identified, geometric characteristics such as length, area, volume, and circularity were determined. Details of each step are presented below.

Data Preparation
A DEM is a regularly spaced grid of terrain elevation and is created from LiDAR datasets using GIS software. This high-resolution DEM is a fundamental element of sinkhole inventories, and the Missouri Department of Natural Resources (MDNR) actively uses a LiDAR-derived DEM to identify sinkhole boundaries. For this study, a DEM at a resolution of 1 m was acquired from the Missouri LiDAR DEM Download Tool of the Missouri Spatial Data Information Service [31]. The airborne LiDAR dataset for the study area was collected in 2007 using the Leica ALS-60 LiDAR system. The estimate of accuracy of LiDAR data is documented as 15 cm for vertical and 50 cm for horizontal accuracy [32]. While the resolution of the constructed DEM is different from the LiDAR point cloud data (which is stored in an LAS file format), it is the best available resolution and high enough to detect and map sinkholes.
Commonly used geomorphometric indices (or parameters) were derived using SAGA GIS software (http://saga-gis.org) and then included as independent variables in the GLM model building process [33]. The LiDAR-based DEM was used to extract the remaining indices. The following 16 indices were derived from the DEM: Slope, aspect, plan curvature, profile curvature, closed depression, slope height, valley depth, normalized height, convergence index (search radius of 50 m), convergence index (search radius of 100 m), mid-slope position, multiresolution index of valley bottom flatness (MRVBF), multiresolution ridge top flatness index (MRRTF), mass balance index (MBI), topographic position index (TPI), and topographic wetness index (TWI). These indices are defined as:

1.
Slope: Slope is one of the most important factors in hydrology because it is related to surface and subsurface flow velocity and runoff rate over the area of interest [34,35]. As slope increases, time for surface infiltration decreases, resulting in an increase in soil erosion. In this study area, the slope ranges from 0 to 81.6 • (0 to 680% in percent rise). Steep slopes are located within the southwest area along the Little Sac River.

2.
Aspect: Aspect (or slope aspect) refers to the primary direction of change of a DEM and is expressed in degrees in a clockwise direction from north (ArcGIS 2010). The slope aspect affects exposure to rainfall, wind, and vegetation cover of the area [34,35] Profile curvature (PRC): PRC is the rate of change in gradient [34,36,37].

5.
Closed depression (CD): The boundary of a depression is defined as the spatial extent of maximum water surface level when the depression is filled with flood water and then starts spilling out. Therefore, a closed depression can be a significant indicator of a sinkhole boundary in karst landscapes. 6.
Slope height (SH): SH is defined as the relative height above the closest modeled drainage accumulation [38,39]. It ranged from 0 to 29 m for the study area. 7.
Valley depth (VD): VD is the vertical height below summit accumulation [38]. It ranged from 0 to 27 m for the study area. 8.
Normalized height (NH): NH is the normalized difference between SH and the VD and is unitless [38,39]. 9.
Convergence index (CI50, 50-m search radius): CI50 is used to determine whether water flow from neighboring cells diverges or converges. Convergence is calculated using flow direction between adjacent cells based on the aspects of neighboring cells [40]. 10. Convergence index (CI100, 100-m search radius): The same CI but the search radius of 100 m was used. 11. Mid-slope position (MSP): MSP has values ranging from 0 (minimum slope) to 1 (maximum vertical distance from valley bottom or ridge top, i.e., valleys and crests) [41].

Multiresolution index of valley bottom flatness (MRVBF)
: MRVBF is a measure of flatness and lowness depicting depositional areas [42]. Higher values correspond to larger valleys. It ranged from 0 to 4.99 for the study area. 13. Multiresolution ridge top flatness index (MRRTF): MRRTF is a measure of flatness and elevation, depicting stable upland areas [42]. Higher values correspond to ridges. It ranged from 0 to 5.67 for the study area. 14. Mass balance index (MBI): MBI is derived from transformed elevation, slope, and mean curvature.
Positive values indicate convex forms (upper slopes, crests) whereas negative values indicate concave forms (valleys and lower slopes) [43]. 15. Topographic position index (TPI): TPI is the difference between a raster cell elevation and the average elevation of neighboring cells. TPI is calculated as: where Z o is the elevation at a central point and Z is the mean surrounding elevation. Positive TPI values denote that the cell is located higher than its average neighborhood, whereas negative values denote that the cell is in lower position [35,44]. 16. Topographic wetness index (TWI): TWI combines the local upslope contributing area and slope.
TWI is calculated as: where α is the upslope catchment area per unit contour length and β is the local slope gradient in percentage. High values indicate drainage depressions whereas low values indicate crest and ridges [45]. For more detailed information, you can find them at Supplementary Materials. Figure 3 shows the map of 9 geomorphometric indices used for the final model development. While 16 indices were considered as significant contributing indices to the sinkhole susceptibility map, 7 indices were excluded by eliminating insignificant variables and/or highly correlated variables. Details of such a process will be discussed in Section 4.1. All maps were produced with a spatial resolution of 1 m.

GLM Model Selection-Logistic Regression
Generalized linear modeling (GLM) was employed to determine the existence of sinkholes, which was necessary for generating the probabilistic sinkhole susceptibility map. Generalized linear models are extensions of linear regression models and are used to handle dependent variables with non-normal distributions [46,47]. Logistic regression (LR) was selected as the GLM, and those geomorphometric indices were used as input variables. Logistic regression is a member of the family of GLMs and can be used to regress a dichotomous (or binary) dependent variable on a series of independent variables [48].    In this study, the binary dependent variable represents presence (1) or absence (0) of a sinkhole. It was transformed into a logit variable, and then the maximum likelihood estimation was applied to predict the model parameters. The LR model estimated the odds of an event occurring, and also assessed the relative importance of each individual variable within the fitted model. The probabilistic relationship between sinkhole occurrence and its dependency on geomorphometric variables was computed from the following: where P(S) is the probability of an event occurring. In the present study, the value refers to the estimated spatial probability of sinkhole occurrence. P(S) varies from 0 to 1 on an S-shaped curve, where 0 indicates 0% probability of a sinkhole and 1 indicates 100% probability. The term Z is the linear combination of independent variables, which varies from −∞ to +∞, and can be defined as: In Equation (5), b 0 is the intercept of the model, x i (i = 0, 1, 2, . . . , n) are the independent variables, n is the number of independent variables, and b i (i = 0, 1, 2, . . . , n) are the coefficients of the LR model Remote Sens. 2019, 11, 1592 7 of 16 associated with each of the independent variables. The relationship between P(S) and Z can also be expressed as: where P(S)/(1 − P(S)) is the odds ratio.
To evaluate the relative importance of independent variables on sinkhole development, the best subsets logistic regression method was used. First, GLMs for all possible combinations of independent variables were fitted separately, and then fitted models were ranked based on goodness-of-fit criteria [49]. Both Akaike information criteria (AIC) and Schwarz-Bayesian information criteria (BIC) were computed as a measure of goodness-of-fit, as follows: where L is the maximum value of the likelihood function for the model, k is the number of estimated parameters in the model, and n is the number of observations. Given a set of candidate models for the data, the model with the lowest AIC and BIC values is preferred [50,51]. It is noted that the terms, 2 and ln(n) in Equations (7) and (8), respectively, penalize the models with large numbers of independent variables to avoid overfitting.

Variable Selection
It is necessary to include only significant independent variables in the LR model; thus, the Wald statistic was used to check the contribution of independent variables in the model. A p-value of 0.05 or less (significant level of 95%) was used to identify significance. Therefore, any non-significant variables (having a p-value greater than 0.05) were excluded from the model.
The LR model is generally sensitive to collinearity (i.e., substantial interactions) between independent variables. High multicollinearity between independent variables leads to high standard errors of the regression estimates, and, consequently, interpretation of relative importance of the independent variable would be unreliable. The variance inflation factor (VIF) is commonly used to check the degree of multicollinearity. The model is regarded to be free from the multicollinearity problem if the value is less than 10 [52,53]. Any variable with a VIF or greater than 10 was excluded from the analysis. A Spearman correlation test was also conducted to examine the association between each of the two independent variables. If the correlation coefficients were high (Spearman's rho (ρ) value of 0.6 or greater), one of the variables was excluded from the LR model [54].

Variable Selection
The LR analysis started with 17 independent variables, including DEM and 16 geomorphometric indices derived from the DEM. In this study, the variables were not standardized to assist in interpreting odds ratios, since standardization does not affect LR. The regression results demonstrated that all independent variables were significant (p < 0.05). However, it was necessary to check the correlation among variables. Both Spearman correlation and VIF were checked. While the VIF result indicated no significant problem of multicollinearity of the variables of interest (VIF < 10), the Spearman correlation analysis revealed that seven variables were highly correlated. These seven variables were slope, depression, normalized height, CI50, CI100, MRRTF, and TPI. These variables did not meet the Spearman correlation criteria of ρ < 0.6. A summary of Spearman correlation coefficient analysis is shown in Figure 4. demonstrated that all independent variables were significant (p < 0.05). However, it was necessary to check the correlation among variables. Both Spearman correlation and VIF were checked. While the VIF result indicated no significant problem of multicollinearity of the variables of interest (VIF < 10), the Spearman correlation analysis revealed that seven variables were highly correlated. These seven variables were slope, depression, normalized height, CI50, CI100, MRRTF, and TPI. These variables did not meet the Spearman correlation criteria of ρ < 0.6. A summary of Spearman correlation coefficient analysis is shown in Figure 4. In the next step of LR analysis, Model 1 was built using the remaining ten variables and the significance test and correlation analysis were also undertaken. For Model 1, all ten variables were significant at the p < 0.05 level. According to both VIF and Spearman correlation tests, no independent variables were in violation of multicollinearity. However, to build up a simpler but still robust model with fewer independent variables, stepwise LR with a backward selection was used. In this step, the model was adjusted by sequentially eliminating the variable with the smallest relative importance of the variables. Summary results of LR analysis for the three candidate models are presented in Table 1. The relative importance of independent variables was estimated by the absolute value of Wald statistic (z value), which is the regression coefficients divided by their standard errors [55]. Higher Wald statistic indicates greater importance of the corresponding variable. Thus, the largest relative importance of Model 1 was determined to be TWI, followed by DEM, SH, MRVBF, MBI, PLC, VD, PRC, aspect, and MSP, in order. In the Model 2 and 3, the least important variable was removed from the previous model, and the relative importance was estimated as in the Model 1. As shown in the Table 1, the order of relative importance did not vary from Model 1 to Model 3, except for the order between SH and MRVBF in Model 2 with insignificant difference in the importance. Based on the p value of each variable, all variables have p value less than 0.05, meaning there is a statistically significant relationship with sinkhole events. The In the next step of LR analysis, Model 1 was built using the remaining ten variables and the significance test and correlation analysis were also undertaken. For Model 1, all ten variables were significant at the p < 0.05 level. According to both VIF and Spearman correlation tests, no independent variables were in violation of multicollinearity. However, to build up a simpler but still robust model with fewer independent variables, stepwise LR with a backward selection was used. In this step, the model was adjusted by sequentially eliminating the variable with the smallest relative importance of the variables. Summary results of LR analysis for the three candidate models are presented in Table 1. The relative importance of independent variables was estimated by the absolute value of Wald statistic (z value), which is the regression coefficients divided by their standard errors [55]. Higher Wald statistic indicates greater importance of the corresponding variable. Thus, the largest relative importance of Model 1 was determined to be TWI, followed by DEM, SH, MRVBF, MBI, PLC, VD, PRC, aspect, and MSP, in order. In the Model 2 and 3, the least important variable was removed from the previous model, and the relative importance was estimated as in the Model 1. As shown in the Table 1, the order of relative importance did not vary from Model 1 to Model 3, except for the order between SH and MRVBF in Model 2 with insignificant difference in the importance. Based on the p value of each variable, all variables have p value less than 0.05, meaning there is a statistically significant relationship with sinkhole events. The independent variables of all models did not correlate with each other (refer to Figure 4) and the VIFs were below 2.06, indicating a low risk of multicollinearity.
In the next step, measures of goodness-of-fit, including AIC, BIC, and pseudo R 2 , were calculated to select the optimal model. Table 2 summarizes the goodness-of-fit statistics for three candidate models. The model with the lowest AIC and BIC values and the highest pseudo R 2 was considered the best-fit model. Models 2 and 3 had the same pseudo R 2 values of 0.692, but Model 2 had lower AIC and BIC values. In addition, Model 1 had a slightly lower AIC than Model 2, but it also had slightly higher BIC and lower pseudo R 2 values than Model 2. Therefore, Model 2 was selected as the optimal model for the study area.  From the foregoing analyses, nine of seventeen variables were found to play an important role in sinkhole identification and were selected to develop the LR model. These nine variables were: DEM, aspect, plan curvature, profile curvature, slope height, valley depth, MRVBF, MBI, and TWI. The where Z is a linear combination of independent variables. A positive coefficient tends to increase the probability of occurrence while a negative one implies the opposite outcome. Five independent variables, DEM, plan curvature, valley depth, MBI and TWI, exhibited a positive influence on sinkhole occurrence, while aspect, profile curvature, slope height, and MRVBF exerted a negative influence. Figure 5 shows the sinkhole susceptibility map created based on Equation (9); the computed probability of sinkhole occurrence varies from 0 to 1.

Cutoff Value
The resulting sinkhole susceptibility map ( Figure 5) is a raster format in which each 1-m resolution grid cell represents the probability of sinkhole occurrence. In order to determine those high probability areas as sinkholes, a cutoff value was required. This cutoff threshold was determined based on the sensitivity and specificity tests. Sensitivity and specificity are statistical measures of the performance of a binary classification model [56]. The sensitivity (also known as a true positive rate) measures the proportion of actual positives that are correctly identified as such; thus, in this study, it tells how well the model detects sinkholes. The specificity (also known as a true negative rate) measures the proportion of negatives identified as such, and it tells how accurately the model avoids false sinkhole detections. The sensitivity and specificity were calculated as follows: Speci f icity = TN TN + FP (11) where TP is the number of true positives, FN is the number of false negatives, TN is the number of true negatives, and FP is the number of false positives. The setting of sensitivity and specificity is always a trade-off. If the cutoff value is set too low, the number of false negatives decreases, resulting in increased sensitivity. However, the number of false positives increases, and this results in decreased specificity. The effects changes in cutoff values on sinkhole identification are presented in Figure 6. The cutoff value varied from 0.1 to 0.8, and the sinkholes detected by the model were compared with the validated sinkhole database. In Figure 6a, the cutoff value was set low at 0.1 and the result shows a larger proportion of false positive results, leading to a decrease in true negative rate. In contrast, when the cutoff value was set high at 0.8 (Figure 6d), the number of true positives decreased. Thus, the optimum cutoff value of the model can be determined by maximizing the sum of sensitivity and specificity. For this study, the optimal cutoff value was determined as 0.13, which corresponds to the cross-point of sensitivity and specificity curves (Figure 7a). Figure 7b shows the area under the receiver operating characteristic (ROC) curve (also called the AUC). It is a plot of sensitivity versus 1-specificity for all possible cutoff classification probability values, showing the performance of a classification model at all possible probability values. In this study, the AUC was 0.90, demonstrating that the fitted LR model was highly reliable in explaining variability in sinkhole occurrence as a function of selected geomorphometric variables.

Sinkhole Susceptibility Map (Cutoff = 0.13)
A new sinkhole map was developed in binary classification format using a cutoff value of 0.13 ( Figure 8a). The map is a binary raster in which a value of 1 was assigned to grid cells with a probability of equal or greater than 0.13 and a value of 0 was assigned to any cell with a probability of less than 0.13. To produce sinkhole boundaries, contour lines were drawn by connecting all outer points of equal probability class. Figure 8b displays the sinkhole boundaries generated by the GLM model based on the selected cutoff value. Red and yellow lines represent the detected and reference sinkhole boundaries, respectively, in Figure 8b. This approach does not necessarily delineate the sinkhole boundaries but is able to locate sinkholes.  As previously discussed, more or wider sinkhole boundaries can be identified with application of a lower cutoff value. However, it also excessively identifies less susceptible areas for sinkholes and incurs unnecessary costs and time loss in sinkhole prevention and management. In contrast, too high a cutoff value can lead to low detection performance and high rates of false negatives. From the hazard management point of view, this scenario tends to be more dangerous. The damage from not knowing that the area has very high sinkhole potential would be more severe than believing falsely that the area has high potential of sinkholes. Therefore, it is recommended that a low cutoff value is used to minimize false negatives, i.e., avoid predicting an area as safe when it is actually dangerous.

Sinkhole Geometric Characterization
After sinkholes were detected, they were quantitatively evaluated regarding their geometric characteristics. Geometric characteristics of sinkholes include length, width, depth, perimeter, area, volume, elongatedness, and circularity. Length and width were defined as the lengths of the major and minor axes of the sinkhole, respectively. Depth is the maximum vertical distance from the lowest point within the sinkhole to the highest point of the sinkhole. Perimeter (or circumference) is the length around the outside of the sinkhole. Elongatedness can be calculated as the ratio of the length to width of the sinkhole. Sinkholes with elongatedness values close to 1 represent more circular shapes, while sinkholes with an elongated shape have higher elongatedness values. Circularity was defined as the ratio of the area of the shape to the area of a circle having the same perimeter and was measured as 4π × area/(perimeter) 2 . This is a measure of the degree of roundness of the sinkhole [57]. Sinkholes with a perfect circle have a value of 1, whereas a value less than 1 indicates an irregular shape. One sinkhole was selected as an example to determine geometric characteristics. Figure 9 shows the boundary of this sinkhole and a detailed 3D profile view of it, and Table 3 contains the summary of geometric characteristics of this sinkhole.

Conclusions
This study presents the sinkhole susceptibility map using LiDAR-derived DEM data. The following conclusions were drawn: (1) The sinkhole detection and characterization techniques were proposed using the LiDAR-derived DEM data. The proposed algorithm is believed to allow for improved consistency and repeatability. (2) Sixteen geomorphometric parameters were derived from DEM data, and a test for multicollinearity was conducted using both the Spearman correlation coefficient and VIF criteria. As a result, seven out of sixteen parameters were found to be highly correlated and were excluded in the further analyses. The selected nine parameters were DEM, aspect, plan curvature, profile curvature, slope height, valley depth, MRVBF, MBI, and TWI.
(3) Logistic regression was used to construct the probabilistic sinkhole susceptibility map using the reported sinkhole inventory. In order to define the appropriate sinkhole boundaries from the probabilistic sinkhole susceptibility map, it is important to determine the optimal cutoff value. In this study, the recommended cutoff value was calculated to be 0.13, which has maximum sensitivity and specificity values at the same time. (4) The proposed sinkhole susceptibility map with the recommended cutoff value well predicted the reported sinkhole boundaries. While the model achieved a considerably high AUC of 0.90, the cutoff value is based on a training dataset of the study area, making the current results limited to the study area, and might not be applicable elsewhere. (5) Geometric features of sinkholes such as length, width, depth, perimeter, area, volume, elongatedness, and circularity can be estimated with the proposed sinkhole susceptibility map and LiDAR data. (6) Significant benefits of this study may include (1) identification of non-inventoried (e.g., newly formed or previously non-detected) sinkholes in the database, (2) automatic delineation of sinkhole boundaries, and (3) quantification of a sinkhole's geometric characteristics.