Land Subsidence Susceptibility Mapping Using Interferometric Synthetic Aperture Radar (InSAR) and Machine Learning Models in a Semiarid Region of Iran

: Most published studies identify groundwater extraction as the leading cause of land subsidence (LS). However, the causes of LS are not only attributable to groundwater extraction. Other land-use practices can also affect the occurrence of LS. In this study, radar interferometric techniques and machine learning (ML) models were used for the prediction, susceptibility zoning, and prioritization of inﬂuential variables in the occurrence of LS in the Bakhtegan basin. The LS rate was characterized by applying an interferometric synthetic aperture radar (InSAR). The recursive feature elimination (RFE) method was used to detect and select the dominant combination of indicators to prepare an LS susceptibility map. Three ML models, including random forest (RF), k-nearest neighbors (KNN), and classiﬁcation and regression trees (CART), were used to develop predictive models. All three models had acceptable performance. Among the ML models, the RF model performed the best (i.e., Nash–Sutcliffe efﬁciency, Kling–Gupta efﬁciency, correlation coefﬁcient, and percent bias metrics of 0.76, 0.78, 0.88, and 0.70 for validating phase, respectively). The analysis conducted on all three ML model outputs showed that high and very high LS susceptibility classes were located on or near irrigated agricultural land. The results indicate that the leading cause of land LS in the study region is not due to groundwater withdrawals. Instead, the distance from dams and the proximity to anticlines, faults, and mines are the most important identiﬁers of LS susceptibility. Additionally, the highest probability of LS susceptibility was found at distances less than 18 km from synclines, 6 to 13 km from anticlines, 23 km from dams, and distances less than 20 to more than 144 km from mines. The validated methods presented in this study are reproducible, transferrable, and recommended for mapping LS susceptibility in semiarid and arid climate zones with similar environmental conditions.


Introduction
Land subsidence (LS) is characterized by vertical downward movements of the earth's surface, usually due to insufficient geological support [1] and often caused by various environmental, hydrogeological, and economic-related impacts and changes [2]. LS can occur due to either natural or anthropogenic causes [3]. LS is most often a cause of concern because it diminishes an aquifer's storage capacity [4], leading to geological cracks, fissures [5], damage to civil infrastructure [6], and increased flood risk [7]. 54 • 00 E and 54 • 22 E longitude ( Figure 1). The climate is semiarid, with an average annual minimum and maximum temperature of 17.6 and 24.1 • C [37]. The long-term  average annual rainfall of the region is 320 mm, and the average yearly pan evaporation varies from 1763 to 2849 mm [34]. The average annual temperature has increased by 1.7 • C (19.5 to 21.2 • C) over the last 40 years (1980-2020; [37]). The basins in this region are among the prohibited basins in terms of groundwater exploitation status [38]. Lithological classes of the Bakhtegan basin are shown in Table S1. Based on lithology, this region (known as Sazand-e Neyriz) is divided into three main groups [39]. The deepest layers include thin dolomite and green shale. The mid-layers include siltstone, sandstone, and green shale silt, and the upper layers contain a thin layer of lime clay and lichen stone. The region is located on a sedimentary rock structure called the Sanandaj-Sirjan zone. This zone is Iran's most active and disruptive structural zone consisting of evaporative sediments and interlayer volcanic rocks [40].

Study Design
For the current investigation, the annual rate of LS (from January 2019 to December 2019), using InSAR data from Sentinel-1 obtained from the Copernicus Open Access Hub and the Sentinel Application Platform (SNAP 7), was obtained. Observed data were also collected from the region to validate the InSAR LS rate estimation ( Figure 2). To develop the LS susceptibility map, ML models were implemented. In order to assess the efficacy of the ML models, a commonly employed approach involves partitioning the available dataset into training and validation subsets, typically using a 70% to 30% data split [41]. For the identification of model inputs, the recursive feature elimination (RFE) model was used [42], and jackknife resampling was used to determine the most influential variables [43]. Additional detailed information is provided in the forthcoming text.
In this research, a digital elevation model (DEM) [ALOS PALSAR 12.5 × 12.5 m (https://vertex.daac.asf.alaska.edu/, accessed on 12 June 2019)] was used to obtain geomorphological variables. The geomorphological variables included elevation [54], slope [54], aspect [54], distances from anticlines and synclines [55], and profile curvature [49]). These LS-determining variables are essential in washing out and transforming saline materials from the soil to the plains of rivers, including transformation and accumulation to lowlands. Lithological units differ relative to LS because of considerable changes in compressible sediment thicknesses [26]. For example, lithology can impact the LS occurrence indirectly by affecting absorption [45]. Similarly, distance from a salt plain (DFSP; [56]) and distance from a saltwater lake (DFSL; [48]) can impact groundwater via interactions between saline and fresh waters, which can indirectly affect LS. Faults that can be the center of tectonic activities, which affect the geological structures and absorption of groundwater, can also create conditions for the accumulation of compressible sediments to impact LS [57]. Hence, the distance from the fault (DFF) is an important variable in LS. The Topographic Wetness Index (TWI) can impact groundwater via flood irrigation, surface water absorption, washing, and dissolution of solid materials in unsaturated regions [58].
We obtained DGWL maps from the Iran Water Resources Management Company data center between 2003 and 2019 ( Figure 3j). Based on the data from 124 observed wells in the Bakhtegan basin, the observed groundwater level drop ranged from 7.8 to 21.9 m (Figure 3j). The TWI (Figure 3o) was developed using SAGA GIS (Ver. 7.1.1) software. DFF (Figure 3s) was measured using Euclidean tools in Arc GIS software [66]. Lithological ( Figure 3e) and fault ( Figure 3s) maps were obtained from the Geological Survey of Iran (Scale = 1:100,000). Figure 3e shows a lithology map of the study area (Table S1)

Interferometric Synthetic Aperture Radar (InSAR)
InSAR databases include surface transformation imaging and time and regional evolution along different seismic cycles for the last decade [67]. Radar interferometry extracts phase differences between two images taken from the radar sensor, the base image, and the following image. The phases of the two images can be compared after the geometric recording of the current and the prior (base) images. Information about the land surface change, elevation, and atmospheric models can be derived using the two images. The phase difference between the two images after geometric recording produces a new kind of image called an interferogram [68,69]. Each interferogram consists of displacement and other signals. Therefore, the transformation could be recognizable if the two SAR images are taken in a one-time range [70]. The radar interferometric process for extracting LS is shown in Figure 4. In order to obtain an interferogram, the phase difference of two single look complex (SLC) images acquired after and before LS is calculated as follows (Govil and Guha, 2023): The interferometric phase (ϕInt) is composed of several phase components, including the phase due to deformation (ϕDef), the atmospheric phase resulting from variations in atmospheric delay at the two acquisition times (ϕAtm), the phase component related to Earth's curvature (ϕCurv), the phase component associated with topography (∆ϕTopo) and the noise term due to decorrelation (ϕNoise) [71]. In order to obtain the contribution of deformation to the interferometric phase, it is necessary to calculate and subtract all of the relevant features from the interferogram, thus producing differential interferograms [72]. After phase unwrapping is performed, the integer phase cycles are calculated for each relative wrapped phase measurement. Finally, the radar coordinate system is geocoded into the global coordinate system. Interferograms are filtered with Goldstein filters to reduce vegetation-induced noise and to enhance signal-to-noise ratios [73].
With a minimum orbit repeat cycle of six days in both descending and ascending orbits, Sentinel-1A captured land over Iran with a resolution of 5 × 20 m (range × azimuth). For the current study, 12 images of Sentinel-1A in pass 64 (low-pass mode) collected from January to December 2019 were obtained. Radar LS values are converted from GPS measurements with some errors based on the satellite orbit accuracy. The longest wavelength component of the InSAR LS rate maps contains an error due to satellite orbital uncertainty [74]. Precise orbit ephemerides provided by Sentinel-1 quality control subsystems are used to reduce the errors associated with satellite orbits [75]. To eliminate the topographic phase's effect and increase the accuracy of the calculations, the digital Shuttle Radar Topography Mission (SRTM) elevation model was used in the current work with a spatial resolution of 30 m for interference mapping. The atmospheric artifact is corrected using external data from the Generic Atmospheric Correction Online Service for InSAR (GACOS) [76]. As part of Sentinel-1A, the ESA provided precise orbit data so that the reference phase could be removed from differential interferograms and orbital errors could be minimized. The LS signal is readily detectable in various interferograms. The highest rate of LS was in 2019 in the southern and southwest parts of the basin and on the lake shore at the rate of 8 to 10 cm per year, respectively ( Figure 5).

Most Determinant LS Susceptibility Variables
Although the indicators in this study were based on the variables used and their relative impact on the LS, additional indicators can affect the study results. Similarly, combining variables may also provide different results [77]. Therefore, an attempt was made to implement a method of selecting variables to identify the most influential variables. The contribution of the influential variables affects the models' ability to predict LS susceptibility accurately; hence, understanding the role of the respective variable in LS susceptibility is a greatly needed area of inquiry. For example, the jackknife (cross-validation) test is a type of valuable analysis that helps to identify the relationship between LS occurrence, vulnerability, and influencing variables. Hence, this study investigated the importance of 27 variables used in modeling LS maps using the Jackknife test [43,78] in RStudio software.

LS Susceptibility Modeling and Validation Map
The effectiveness of ML models is considerably affected by data size and quality because unnecessary or unrelated variables can significantly decrease efficiency [79]. An index selection method can be implemented to identify the most critical indices and improve ML efficiency. For this purpose, the RFE was used to find and choose the most crucial combination of indices to develop LS susceptibility maps. An RFE algorithm is an optimized subset of parameters utilizing a hierarchy that identifies parameters based on the importance of RF indices (i.e., average precision decrement). The algorithm recursively makes multiple RF models. In each recursion, parameters with minor importance (identified based on mean accuracy reduction) are eliminated. Following this step, subsets of parameters with the smallest OOB error are chosen [80]. Random forest (RF) was used for this portion of the investigation. RF has been used previously and is well-accepted for these applications [81,82]. The Card package in R software was used to select the most significant indices based on five-fold cross-validation and metric precision.
The current study developed an LS susceptibility map using three ML models (CART, RF, and KNN). CART is a non-parametric method that uses input data to make prediction relations [83]. CART was used in the current study because it has shown an excellent function with internally non-linear and heterogeneous processes. This method can be used to classify the parameters, which depend on a collection of classes based on the needed level of internal homogeneity, and form a simple model for each set of parameters [84].
Breiman developed the random forest (RF) algorithm in 2001 (co-created by Cutler et al. [85]); as a classification and regression method, it can detect linear and non-linear relations between parameters for classification and regression. This method is used for assessing multiple problem cases, including the generation of a prediction law for a learning problem under observation, evaluation, and ranking of variables considering their result prediction ability. This approach includes various arbitrary decision-making trees. In this research, the RF package of R software was utilized. An RF model explicitly measures the parameter importance using the following two criteria; the average decrease in the Gini Index (GI) and an increasing percentage in RMSE [86]. In this research, we measured the importance of variables by percentage increase using RF because the GI plays a role in calculating variable importance [86]. An essential factor when using the RF model is each branch's minimum number of leaves (notes of the decision trees). No discrete methods exist to determine the number of leaves (nodes). In the past, depending on the problem under inquiry, it has been determined based on trial and error. In this research, the minimum number of leaves is between 5 and 500, and the minimum number is determined based on trial and error.
The K-nearest neighbors (KNN) classification is a model for non-parametric classification that Wu et al. [87] named as one of the ten most prominent algorithms. Because of its simple execution and distinct operation, the method is widely used in data mining and ML observations [88]. KNN can be used to identify complex and non-linear relations among observations. The KNN classification includes the two nearest neighbor parameters (K) and the power (P) [26]. K can be assigned using source data and input variables, while P is based on the weight-distance relation and measures each data point's (K's) degree of similarity. The operation of the KNN models can be affected by many parameters, such as choosing different K amounts and the option of distance measurement [88].
As part of the model development procedure, validation is essential [89]. Validation is used in modeling procedures to demonstrate that the model produces accurate results [90]. Previous authors noted that ML is highly precise [84]. However, uncertainties in model predictions are inevitable, so the validity of predictions and model functionality assessment is critical. In the current study, standard deviation (STDEV), root-mean-square error (RMSE), Nash-Sutcliffe efficiency (NSE), RMSE-observations standard deviation ratio (RSR), correlation coefficient (COR), R-squared (R 2 ), Kling-Gupta efficiency (KGE), and percentage of bias (PBIAS) [89,91,92] are provided in five packages of the R platform for the functionality and precision of the models.

LS Susceptibility Prediction Maps
All three ML models predicted very low LS susceptibility over most of the study area (Table 2; Figure 6), with RF = 58.13%, KNN = 55.54%, and CART = 86.805 of the total land area ( Table 2). For example, in the KNN model, the area of LS susceptibility classes was estimated to be 3.33, 1.08, 6.73, 33.32, and 55.54% of the region's total area for very low to very high LS susceptibility, respectively (Table 2). Based on the results from the RF, KNN, and CART models, the highest LS susceptibility values were obtained in the western region ( Figure 6).

Most Determinant LS Susceptibility Variables
For natural resource management, it is necessary to determine the most important variables (i.e., predictors) in the susceptibility to LS. Brown and Nicholls [93] highlighted that analyzing the relationships between LS and the most influential variables facilitates managers to focus on the most impactful human activities that affect LS. Subsidence-induced faulting is caused by groundwater extraction. However, there are few examples of how to determine the parameters that cause surface faults, and direct measurements are rare [94]. It has become increasingly helpful for decision-makers to use ML algorithms to gain new insights into the relationship between influential variables and LS occurrences. ML is now considered as a practical tool to improve environmental management practices in LS-prone regions [95]. To address the most determinant LS susceptibility variables, the jackknife test [43,78] was applied to evaluate the relative importance of each variable. For all three models, distance from a mine (DFM), distance from a syncline (DFS), distance from a dam (DFD), and distance from a fault (DFF) were identified as the most crucial input variables, as shown in Figure 7. The climate type, flow accumulation, and flow direction, which caused a minor increase in the modeling errors, are removed and ignored in the process (Figure 7). The type of model affects the relative importance of variables in LS modeling [96]. Thus, in one model, predictive variables may be highly important, while in another, they may not be so important. However, a consistent result across all three ML models was found in the current work. Furthermore, based on the results shown in Figure 7, the climate type, Topographic Wetness Index (TWI), flow direction, and flow accumulation indices had the most negligible impact on the occurrence of LS in the region (Figure 7).
Mining-induced catastrophic geohazards, such as LS, slowly occur around opencast mines due to the dewatering of mines [97,98]. As a result of mining activities, large areas suffer from LS, soil erosion, and waterlogging, which is detrimental to industrial and agricultural production [61,99]. He et al. [99] in China showed that the continuous operation of mines caused large-gradient LS, resulting in large-scale water accumulation in the deformation area. Similar findings have been found in the current study regarding the impact of the distance from mines (DFM) on LS. According to Loupasakis et al. [97], LS extended for 4 km around a mine in Northern Greece. Zhang et al. [70] concluded that the distribution of a salt field in China was highly correlated with the distribution of LS depression in their study area. These are similar findings to the results of the current study regarding the impact of the distance from synclines (DFS) on LS. We highlighted that the distance from dams (DFM) is one of the main factors influencing LS occurrence. Previous research showed that the construction of dams has significantly changed hydrological regimes [100] at local and regional scales, including substantially impacting LS susceptibility and occurrence [101]. For example, due to the building of dams, many river systems deliver much less sediment to flood plains [102]. A lack of adequate sediment supply leads to an uncompensated state of natural subsidence. The highest LS rate was observed downstream of the dams in central Iran [103]. The distance from faults (DFF) has been identified as the most critical variable in the occurrence of LS [56,[104][105][106][107][108], because landforms can be a direct result of the activities of faults and geological structures. It is common for faults to cause the Earth's crust to move, resulting in the subsidence of land [105]. Similarly, Bouwer [106] found that LS can occur when rigid slabs of alluvium rotate around buried faults. More LS occurs around the preexisting fault due to a weak interface within the soil body, and LS decreases as the distance to the fault increases [105]. Consistent with the work of Saro et al. [107], Shahabi et al. [108] considered the distance from laminates (faults) as the most critical variable in the occurrence of LS and stated that the DFF has a significant effect on the occurrence of LS. The present study showed that the leading cause of land LS in the study region is not due to groundwater withdrawals. Instead, the distance from dams (DFD) and the proximity to anticlines, faults, and mines (i.e., DFA, DFF, and DFM variables, respectively) are the most important identifiers of LS susceptibility [109]. However, it should not be overlooked that all the above indirectly affect local to regional groundwater processes.  Table 1.

Validation of LS Susceptibility Maps
The performance evaluation results of the RF, KNN, and CART models in predicting LS susceptibility indicated that all three models had relatively good results (NSE = 0.81-0.95 to 0.69-0.76 for training and validation phases, respectively; Table 3) (Table 3). Previous studies have shown that RF models perform better than CART models, which is consistent with this result [110][111][112]. Since none of these previous studies simultaneously applied these three ML models (RF, CART, and KNN) to LS prediction, a direct comparison of our results with theirs is difficult. However, a few studies have examined the performance of RF, CART, and KNN in other natural resources disciplines, including flood susceptibility [113], land-use and land-cover change [114], and ecological modeling [80], which showed similar trends relative to the ML models used in the current investigation and using the same (or similar) data set. The present work showed that a decision tree using several classification algorithms (i.e., RF) yields more reliable results than a single classification algorithm [115]. Because single models sometimes have bias, variance, and classification errors, in decision tree models using multiple single models (such as RF), the classification performance increases, and their variance and errors decrease [110]. In the present study, it was found that fitting multiple trees in the RF model was capable of overcoming one of the greatest disadvantages of single decision tree models, that is, low predictive performance [110,116,117]. According to Pham et al. [118], when surface and subsurface processes are complex, it is necessary to use more non-linearity rules to model geohazards (e.g., LS) in heterogeneous terrains, and models with greater flexibility perform better. Importantly, all three ML models demonstrated good performance, thus highlighting their overall capabilities in modeling LS using ML techniques, similar to past research [110]. Based on the map generated by the RF algorithm (Figure 6), the highest probability of LS susceptibility was found at distances less than 18 km from synclines, 6 to 13 km from anticlines, 23 km from dams, and distances less than 20 to more than 144 km from mines. Table 3. Performance statistics of machine learning models (random forests (RF), k-nearest neighbors (KNN), and classification and regression trees (CART)) in land subsidence (LS) susceptibility prediction.

Study Implications and Future Directions
In Iran, around 50% of the plains lack water, and there is likely to be continued pumping of water (legal or unauthorized), which could lead to LS [49]. Prevention of or reduction in LS susceptibility can be aided by robust, accurate, and meaningful spatial modeling. The results of the current research show that land subsidence in the study area dramatically contributes to the subsequent conversion of valuable ecological land into degraded non-producing land. This finding is consistent with the LS trends in other parts of Iran from the end of the last century to the beginning of the 21st century [24,119,120]. This awareness has increased as water agencies have increasingly released reports identifying the cumulative impacts of subsidence on significant infrastructure, resulting in hundreds of millions of dollars of damage. For example, the highest rate of LS was in 2019 at the rate of 8 to 10 cm per year in the southern and southwest parts of the study area and the lake shore ( Figure 5). The findings of the current investigation strongly support the need to include addressing subsidence as part of Iran's groundwater management requirements. In addition, while beyond the scope of the present study, future research should involve using the methods developed in the current work to create a framework to estimate LS with various techniques and apply it to a series of validation studies in different parts of Iran. This would facilitate the building of a set of comparable case studies that may allow the aggregation of environmental effects of LS in various parts of the country and the world. Finally, LS processes are dynamic, requiring the continuous updating of susceptibility zoning maps, such as the innovative processes presented here. With the aid of advanced InSAR analysis techniques that utilize newly acquired data, as well as GNSS stations, it will be possible to generate updated LS susceptibility and rate maps.

Conclusions
Environmental destruction is a consequence of land subsidence (LS). Hence, the identification, assessment, mapping, modeling, and management of LS are essential. In this study, we first estimated the exact amount of LS using an interferometric synthetic aperture radar (InSAR) technique. Then, the variables that affect LS susceptibility in the Bakhtegan basin were prioritized using three ML models (i.e., RF, KNN, and CART). The results showed that the RF model had the highest conformity and correlation coefficient among the models to predict the LS phenomenon. Based on the validation results of the models, the RF model had excellent predictive power and can be used to analyze geographical and environmental variables and plot the susceptibility of LS. The present study showed that the determinants and subsequent susceptibility of LS in the region of study are most highly related to the distance from a dam (DFD), distance from an anticline (DFA), distance from a fault (DFF), and distance from a mine (DFM). It was also shown that InSAR is a powerful tool for measuring LS in the drylands of Iran. In addition, the RF algorithm was shown to be an appropriate model to assist land managers, conservation authorities, natural resources stakeholders, and policy-makers in improving LS preventative practices and decision-making. While the ML methods used in this study have been previously applied in various fields, they are novel in the context of LS susceptibility prediction. The methodology presented in this study can serve as a template for others wishing to advance predictive confidence in LS susceptibility in other regions globally.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/land12040843/s1, Table S1: Descriptions of the lithology of the Bakhtegan basin. Reference [121] are cited in the Supplementary Materials.

Data Availability Statement:
The data that support the findings of this study are available from the first and second authors, upon reasonable request.