Deep-Learning-Based Automatic Sinkhole Recognition: Application to the Eastern Dead Sea

: Sinkholes can cause signi ﬁ cant damage to infrastructures, agriculture, and endanger lives in active karst regions like the Dead Sea’s eastern shore at Ghor Al-Haditha. The common sinkhole mapping methods often require costly high-resolution data and manual, time-consuming expert analysis. This study introduces an e ﬃ cient deep learning model designed to improve sinkhole mapping using accessible satellite imagery, which could enhance management practices related to sinkholes and other geohazards in evaporite karst regions. The developed AI system is centered around the U-Net architecture. The model was initially trained on a high-resolution drone dataset (0.1 m GSD, phase I), covering 250 sinkhole instances. Subsequently, it was additionally ﬁ ne-tuned on a larger dataset from a Pleiades Neo satellite image (0.3 m GSD, phase II) with 1038 instances. The training process involved an automated image-processing work ﬂ ow and strategic layer freezing and unfreezing to adapt the model to di ﬀ erent input scales and resolutions. We show the usefulness of initial layer features learned on drone data, for the coarser, more readily-available satellite inputs. The validation revealed high detection accuracy for sinkholes, with phase I achieving a recall of 96.79% and an F1 score of 97.08%, and phase II reaching a recall of 92.06% and an F1 score of 91.23%. These results con ﬁ rm the model’s accuracy and its capability to maintain high performance across varying resolutions. Our ﬁ ndings highlight the potential of using RGB visual bands for sinkhole detection across di ﬀ erent karst environments. This approach provides a scalable, cost-e ﬀ ective solution for continuous mapping, monitoring, and risk mitigation related to sinkhole hazards. The developed system is not limited only to sinkholes however, and can be naturally extended to other geohazards as well. Moreover, since it currently uses U-Net as a backbone, the system can be extended to incorporate super-resolution techniques, leveraging U-Net based latent di ﬀ usion models to address the smaller-scale, ambiguous geo-structures that are often found in geoscienti ﬁ c data.


Introduction
Subsidence is a worldwide phenomenon of vertical ground settlement, either due to natural or anthropogenic reasons [1].A special form of subsidence is the appearance of enclosed depressions, so-called sinkholes, as a morphological landscape expression of karstified rock in the underground.Well-known examples of sinkholes are located in Florida, Turkey, Germany, China, Spain, at the Dead Sea and in many other karst environments worldwide (see e.g., [1,2]).
Sinkholes are a remarkable natural hazard, with the potential to cause extensive damage to the environment, infrastructure, and human life (see [3][4][5]).Thorough and accurate mapping of sinkholes is important to identify patterns and monitor sinkhole activities, communicate necessary steps to prevent or mitigate their damage and it also contributes to the creation of comprehensive sinkhole inventories.Researchers such as Galve et al. [6], Galve et al. [7], Sevil and Gutiérrez [8], and Gutiérrez [9] have highlighted the importance of these sinkhole maps in supporting decision-making processes for land use, development projects, and hazard preparedness in areas susceptible to sinkhole formation.
Over time, the methods used to map sinkhole outlines have remarkably evolved, reflecting an influential shift in the types of the data and technological approaches used.Initially, the dominant methods were on-site field assessments and geophysical surveys, along with manual inspection of topographic maps and stereo-images (e.g., [10,11]).An alternative source of baseline data are Digital Elevation Models (DEMs), which can be derived from both passive remote sensing data, such as aerial photography and satellite imagery (e.g., [12,13]) or from active remote sensing sources such as airborne laser scans (LiDAR, [14]) or radar, e.g., the Shuttle Radar Topography Mission (SRTM, [15]).In the last 40 years, the increased availability of such DEMs has facilitated development of many automated methods of depression mapping, often performed within a Geographic Information System (GIS).These methods tend to leverage the geometric properties of sinkholes to identify them, typically by simulating the inundation of the terrain model with water (Figure 1).Despite the undoubted increase in efficiency and objectivity afforded by these approaches, they require up-to-date DEMs of sufficient resolution to be obtained every time a sinkhole inventory is to be updated, which can be extremely costly and time-consuming.Furthermore, the mathematical generalization of sinkhole geometry necessary to apply these 'top-down' approaches can be inflexible, and can result in large numbers of depressions being missed, especially in the case of subtle and shallow sinkhole morphologies, or nested systems.Please refer to [1] (Sect.6.3.7) for a thorough review of sinkhole mapping techniques.
In the last 20 years however, the combined advances in image processing and statistical optimization have facilitated an explosion in 'data-driven' automatic image classification, e.g., training deep, convolutional neural network architectures to recognise data patterns and other methods (e.g., [16][17][18][19][20]).After manually labelling a subset of the total population of studied features, training data is extracted and turned into image patches of fixed size.The model then identifies common patterns among the training data, such as edges of various size and orientation for example.This is completed in several stages/layers, with the dimensions of input data being reduced between stages and more abstract and complex patterns identified at each stage.Analysing the statistical relationships between patterns, the model is then able to generalize and classify data it has not seen before.The exact architecture of a model can be adjusted according to the specific task at hand.The intrinsic nature of these models allows them to be far more scalable and efficient than explicit categorisation models, such as the approaches presented in Figure 1.Such machine learning frameworks have proved to be applicable to the detection and mapping of sinkholes (Table 1), and their versatility and adaptability has been shown across many other object mapping applications.
While previous studies have significantly advanced the field of sinkhole detection using machine learning models, a review of these works reveals specific methodological challenges.For instance, limitations in relation to data availability (especially highresolution elevation models), the need for manual verification to ensure accuracy, a lack of testing outside of limestone karst regions, and resolution limitations that may not fully capture the diverse geometrical characteristics of sinkhole instances, have been noted across different approaches (see Table 1 for detailed limitations of each referenced study).These limitations underline the importance of developing more adaptable and efficient methodologies for sinkhole detection.[21], which maps the depressions according to simulated stratification of water within them.Adapted with permission from [21], 2024, Elsevier.(B) the 'D8' method of Jenson and Domingue [22], which uses a moving window to map the watersheds within the depression.This method and that shown in (A) become very computationally intensive with high-resolution data.(C) the 'priority fill' method of Wang and Liu, [23], which is able to simulate filling of the entire compound depression in one pass of processing.This method offers an improvement in run-time of a factor of 30 on (B), but is not able to capture the internal complexity of the compound depression.Adapted with permission from [23], 2024, Taylor & Francis.(D) The 'contour tree' method developed by Wu et al. [24,25], which builds on the 'priority fill' method to produce a graph ('tree') of contours within the compound depression, allowing nested depressions to be identified and labelled by their rank.This allows for more accurate automated updating of depression location and morphometric databases.The method has since been further refined for efficient computation (see [26,27]).Adapted with permission from [25], 2024, Elsevier.Our research introduces an approach to sinkhole mapping, using RGB visual bands only as the data source.This method aims to overcome limitations highlighted in earlier studies by using a modified deep learning pipeline.Our main objective is to develop an automated model capable of accurately mapping the spatial distribution of sinkhole instances by analysing aerial images of different resolutions.This was supported by the use of labelled sinkhole data from the evaporite karst region of Ghor Al-Haditha, situated on the eastern shore of the Dead Sea.The project was performed in two distinct phases.In the initial phase, a U-Net model was trained, tested, and validated on a dataset of 250 instances derived from a high-resolution orthophoto captured in December 2016, featuring a ground sample distance (GSD) of 0.1 m.The subsequent phase involved transferring the model to a new dataset comprising 1038 instances, obtained from a Pleiades Neo satellite image (GSD of 0.3 m) from August 2022, capturing the same study area.The effectiveness of our algorithm was demonstrated through its accuracy in detecting sinkhole instances across both datasets, underlining the model's transferability and the feasibility of automating sinkhole mapping using readily available satellite images.The code is made publicly available at: https://github.com/ducspe/sinkhole_geohazard_segmentation(accessed on 20 June 2024).

Dead Sea Site Description and Sinkhole Evolution
Sinkhole formation in the Dead Sea region has intensified over the last 35 years, with an escalating occurrence of over 6000 sinkholes, which is closely associated with the rapid regression of the lake and associated shoreline migration [34][35][36].Several conceptual models have been proposed on the origin and evolution of the sinkholes.Notably, different geoscientific methods have revealed that the underlying conditions for sinkhole formation vary from location to location at the Dead Sea shoreline.On the eastern shoreline, where our data subset stems from, and parts of the SW shoreline, physical subsurface erosion and chemical dissolution of evaporites, the general instability of non-evaporitic sedimentary materials and tectonic control, have all been suggested as underlying causes for this phenomenon [35,[37][38][39].For the majority of the western shoreline, however, tectonic control, a purely chemical dissolution of a massive salt layer and related salt-dissolution front migration have been suggested [40][41][42][43][44].
The selected site for this study, Ghor Al-Haditha, is situated on the south-eastern shore of the Dead Sea in Jordan (Figure 2).The site encompasses approximately 9.75 km 2 and lies 340 to 440 m below the mean sea level, bordered by the Dead Sea to the west and the Dead Sea highway to the east.Despite its small size, the site has a high density of sinkholes, with over 1000 sinkholes having formed from 1967 to 2017 [36].The sinkholes at Ghor Al-Haditha are formed in three primary near-surface materials: unconsolidated to semi-consolidated lacustrine silty-clay carbonates, alluvial sand-gravel sediments, and rock salt with interleaved thin mud layers [45].Sinkhole morphology is variable depending upon the mixture of these materials in which they are formed.Sinkholes formed primarily in alluvium and salt materials generally have a high depth to diameter ratio [36], indicating a collapse origin.On the other hand, sinkholes formed mostly in mud tend to be much wider and shallower.They are formed by surface sagging of overlying deposits, and are typically filled with large collapsed and inward-rotating chunks of sedimentary material [45,46].Sinkhole clustering and coalescence into larger compound sinkholes and larger-scale karst depressions are common processes.The smallest features included in the dataset are ~3 m in diameter, while the largest are over 60 m.In addition to these enclosed depressions, several surface stream-channels fed by groundwater springs have formed in the former lakebed of the Dead Sea [37].These canyons are characterised by steep bank slopes, and the springs feeding the streams often emerge within areas of subsidence and sinkhole formation.
Overall, despite the lack of solutional karst features, the dataset we have gathered encompasses a wide variety of evaporite sinkhole materials, morphologies and genesis types.The sparse vegetation and clear skies that are typical for the region amplify the visibility of sinkholes in aerial imagery.The dynamic topography, characterized by three major wadi systems depositing alluvial fan deposits on the coastline [37], adds diversity to the training dataset.In this study, we aim to capitalize on these diverse characteristics to improve our models' performance in identifying sinkholes across different environments.
Sinkhole development in the Ghor Al-Haditha area has been rampant since 1986, with over 1000 sinkholes appearing between 1967-2017 [36], and more than 1000 further sinkholes forming between 2017-2024 [47].The incessant formation of sinkholes has resulted in significant damage, disrupting infrastructure and affecting agriculture in the area [36].In response to these challenges, the Ministry of Energy and Mineral Resources of Jordan has commenced geologic and geophysical surveys in this area since the early 1990s aiming to understand the causes and consequences of the sinkhole formation (e.g., [48]).
As the Dead Sea evolves as a potential site for geotourism, the careful identification and mapping of sinkholes becomes very important [49,50].This attempt goes beyond the immediate safety concerns, offering a route to revive local economies impacted by sinkhole formations.Enhancing our precision in detecting and tracking these formations supports the safety standards of the region for geotourism, protecting both visitors and local communities.Proactive identification can not only prevent significant economic losses from infrastructure damages, but also offers valuable insights into the future trajectory of sinkhole formation.A comprehensive grasp of current sinkhole patterns, therefore, becomes useful in developing informed prevention and response strategies, ensuring the Dead Sea's viability as a geotourism hotspot without compromising on safety or environmental sanctity [49,50].
Various studies, deploying geological, geophysical and hydrogeological surveys, remote sensing, and numerical simulation, have been undertaken at Ghor Al-Haditha to comprehend the spatio-temporal development and the mechanisms of sinkhole formation here (e.g., [51]).Through these studies, local authorities have been able to delineate areas susceptible to sinkhole threats more effectively.Furthermore, aerial images of different resolutions collected over the years by satellites, as well as balloon and drone surveys offer a chronological illustration of sinkhole evolution in the region [47].Given the dynamic geology of the region, these aerial images form a large and diverse training dataset for our deep learning model.

Deep Learning Approach
We chose to frame the research problem of mapping and delineating sinkholes as an instance segmentation problem (Figure 3), enabling the classification of sinkhole instances at the pixel level.This approach offers advantages over simple classification or object detection methods by facilitating detailed spatio-temporal morphometric analysis and evolution monitoring of the mapped sinkholes instances.Moreover, it allows for clear delineation of 'redundant' and 'non-redundant' sinkholes (see Sevil and Gutiérrez [52] for a recent example of this).The model development and training process relied exclusively on the colour channels (RGB) present in the aerial data, as opposed to incorporating other channels like Digital Surface Models (DSMs), which might not be available for all regions.This methodology makes the model more applicable to a broader range of cases.

Datasets and Annotation Process
The study was conducted in two phases, employing two different datasets: high-resolution (HR) drone orthophoto imagery for Phase 1 and low-resolution (LR) optical satellite imagery for Phase 2. The first dataset was compiled from a point cloud collection of high-resolution drone images, which were processed by using structure-from-motion photogrammetry (see [35], for an overview of this process), and the second dataset was generated from a single Pleiades Neo scene acquired in August 2022.Both datasets are taken from the same region shown in Figure 2 above.Notably, the satellite imagery covers a larger area that includes the region covered by the drone.The description of both datasets and the process of their annotation are elaborated below.

Dataset for Phase 1 (HR Drone Images)
For the initial phase of the study, a high-resolution dataset was employed that has been gathered in December 2016 through drone-based, close-range aerial surveys.This dataset comprises optical orthophoto mosaics with a resolution of 0.1 m/pixel, acquired via a 12 MP DJI Phantom 3 inbuilt camera at an altitude of around 100 m.The manual annotation process was directed by a digital surface model (DSM) which was devised by photogrammetric processing of the optical images.For a more detailed explanation regarding the creation of orthophoto mosaics and DSMs, refer to Al-Halbouni et al. [35] and Watson et al. [36].The dataset was explicitly annotated for the purposes of this research project to train a deep learning model.In this phase, particular emphasis was placed on the precision and quality of the annotation process, prioritizing the accuracy of labelled data over its quantity.
The annotation process involved manually digitizing sinkhole extents within the ArcGIS Pro V. 2.9 software, employing various layers and tools within the software to guide the annotation process (Figure 4).In this way, we created a sinkhole instance segmentation mask image where each sinkhole was designated with a distinct colour (Figure 5).Expert knowledge of the distinction between an enclosed sinkhole and an open streamchannel sink has been incorporated at this stage.Finally, the mask image was exported as a TIFF RGB image of the same dimensions as the orthophoto image (12,633 × 15,062 × 3).

Dataset for Phase 2 (LR Satellite Images)
In the subsequent phase of our research, the focus shifted towards exploring the potential of transfer learning [53] to enrich the versatility of our model.To this end, we adapted the model that was initially tailored for drone data, to suit satellite imagery.This transition leveraged a dataset curated from a collection of pre-existing datasets, originating from research studies conducted on satellite images from the year 2022.The images stem from Pleiades Neo satellite with a resolution of 0.3 m/pixel, acquired in August 2022 and pan-sharpened.The annotation process for sinkhole instances was guided by the central points of these sinkhole instances present within the original dataset.Utilizing the capabilities of ArcGIS Pro, we meticulously mapped the extent of each sinkhole as polygons.
The annotation of the satellite images was enhanced using the streaming tool on ArcGIS software-a convenient feature allowing users to craft polygons reflecting the computer mouse movements.Several defining characteristics of the sinkholes assisted in the annotation process.These included pronounced shadowing typically observed in the southern corner, noticeable alterations in texture and colouration, a discernible bright salt layer, and occasionally, water accumulation at the sinkholes' depocentres, as depicted in Figure 6.Annotation limitations primarily stemmed from the lower resolution of satellite images-in comparison with the drone case-and the absence of elevation data to guide the process.

Annotation Special Cases
During the annotation process, we encountered a few unique scenarios.For instance, where vegetation obscured parts of a sinkhole, making the borders not entirely visible, an estimation method was employed for the high-resolution drone dataset.In such cases, sinkholes that were predominantly concealed by vegetation were not mapped.On the other hand, in the low-resolution satellite dataset, the annotator resorted to satellite images from previous years to estimate the borders of obscured sinkholes.This situation was not frequent, affecting only approximately 1 to 10 sinkholes per image, due to the sparse vegetation in the Dead Sea region.
Another unique case involved compound (merged) sinkholes.These were treated differently between the two datasets: in the high-resolution dataset, each sinkhole within a compound structure received a separate annotation, while in the low-resolution dataset, compound sinkholes were consistently mapped as a single unit.The subsequent section will focus on the data preprocessing and training methodology for the deep learning model.

Deep Learning Model Architecture
The choice of an appropriate CNN architecture is important in achieving the objectives of our sinkhole recognition project.In this study, our aim is to identify individual instances of sinkholes, a task known as instance segmentation.This poses a challenge, particularly given the constraints of our limited dataset.To address this, we selected the U-Net architecture [54], which despite its typical association with semantic segmentation, presents a viable solution for our requirements.The U-Net architecture was deliberately chosen for several reasons:

•
Simplifying intermediary steps: U-Net generates semantic segmentation maps that serve as simplifying intermediary steps in our pipeline, followed by post-processing operations like connected-component labelling (CCL) [55] to generate the instance segmentation map.This two-step approach reduces the complexity of the problem, allowing for more accurate segmentation despite limited data.

•
Adaptability to limited datasets: U-Net is particularly adept at handling limited datasets due to its efficient structure.The fully convolutional nature of U-Net allows it to perform well even with relatively small amounts of training data, which was crucial considering the limited number of annotated sinkhole instances available for our study.

•
Multiscale feature extraction: U-Net's architecture, with its encoder-decoder structure and skip connections, allows it to capture multiscale features effectively [56].This is advantageous for detailed sinkhole identification, as it enables the network to retain high-resolution information, while also learning more abstract representations at the same time.In certain scenarios, the skip connections can also help manage class imbalance challenges, commonly encountered in image segmentation tasks, as they facilitate the retention of high-resolution information, important for accurately depicting smaller-scale, minority classes [54].

•
Scalability: U-Net is known for its scalability and efficiency in processing large datasets.Even if the datasets grow substantially with more drone and satellite data being accumulated over time, U-Net's fully convolutional architecture can keep pace with the increased scale and is amenable to efficient parallel processing in hardware.
The fully convolutional nature, also allows to address various input sizes seamlessly, such as the ones we experimented with: 128 × 128 and 256 × 256, as well as other shapes that may arise in the future due to our focus on multi-resolution aspects.

•
Strategic goals: Additionally, the U-Net architecture fits well within our strategic goal of developing a multi-scale, multi-resolution sinkhole detection system.Given the potential for future integration of super-resolution techniques via latent diffusion models such as SR3, which is a U-Net-based super-resolution diffusion model [57], U-Net provides a robust foundation that can be expanded upon.It acts therefore as a backbone and allows to connect heterogeneous components, i.e., segmentation and super-resolution modules, in a consistent manner.Adding such super-resolution techniques can help improve the detection of tightly spaced geological features, for example, around merged sinkholes' edge areas, something we encountered issues with in this work, and would like to address next.
We also considered other networks, such as Mask R-CNN [58] and Cascade R-CNN [59].These architectures are specifically designed for instance segmentation and could potentially handle the task end-to-end.However, they are naturally more computationally intensive to be able to isolate the instances as well, as opposed to simply segment semantically.This is to be expected, since instance segmentation is generally a harder task than semantic segmentation.Furthermore, we do not have enough depictions of the same sinkholes from various perspectives to be able to train them properly.
We would also like to mention the recent advancements in segmentation foundational models, such as Segment Anything [60], that generally make use of the Vision Transformer technology/ViT [61].Unfortunately, they do not work well with geological data, possibly because of the statistics of the data distributions they were originally trained on, amongst other things.Strategically however, we also did not choose a transformer architecture on our end, because it is attention-based and is therefore much more data-demanding than traditional convolution-based models.
Given the data constraints, U-Net was more suitable for our needs, allowing us to reduce the complexity of the task first by doing semantic segmentation, and then performing instance segmentation as a second step, building therefore on the abstraction principle, i.e., the intermediary maps provided by the U-Net.In addition to this, U-Net is also more universal, allowing us to reuse its latent space embeddings in a consistent manner, namely, in the super-resolution extension we are planning via U-Net based diffusion models that would hopefully address the lower edge segmentation scores we are currently facing.
Considering all the above points, we specifically chose the U-Net architecture because it fits our expectations in terms of intended use-case, computational complexity, and consistency with future development plans and features that we intend to try out and possibly incorporate into the broader geohazard detection system.
To ensure that our system effectively identifies each sinkhole instance, we integrated a post-clustering algorithm into our methodology.However, this method faced limitations in differentiating compound sinkholes specifically, often classifying them as single instances.To address this, we experimented with adding a third class in the segmentation process to represent the edges between merged sinkhole instances.The idea behind is that we will attempt to predict where the in-between sinkhole edges are, subtract those pixels such that the sinkholes are separated first, and then apply the clustering algorithm on the separated sinkholes.
Our preprocessing phase included a multistep procedure to incorporate this edge class effectively.Initially, we employed a customized Sobel filter to detect edges between compound sinkholes in the mask image.The formula used for the filtered edge image was: where I is the intensity of the pixels in the edge image and f is the 2D function depicting the original RGB label image.We apply this formula on the label image to detect where there is a sharp transition between the pixel values of one sinkhole and another sinkhole.This effectively means we detect the boundary between 2 merged sinkholes.The pixel intensity of this boundary is the magnitude of the label image gradient, and the components of this gradient are the derivatives/sharp transitions in the x (horizontal) and y (vertical) directions of the image.Note that we apply this formula efficiently, such that we do not detect edges between the sinkholes and the black background, but rather only between merged sinkholes.We do this by scanning the image to see where black background is present and ignoring those patches, i.e., not applying the formula there.Once we have the thin edges computed in this manner, we apply dilation, a morphological computer vision operation, to dilate the edges to a certain extent.A dilated edge has the interpretation of a region of uncertainty, encoding the ambiguity, even for experts, regarding the question: where exactly does one sinkhole end and the other one begin?The model then will have the chance to encode the class uncertainties in its final layer, and become more or less uncertain depending on the different data examples it sees.The dilated edges are finally overlaid onto the binary label image to create a 3-class label image: 'Sinkhole' class, 'Background' or 'Non-sinkhole' class, and 'Edge' class between sinkholes (Figure 7).Subsequently, this finalized label image is ready to be patched and used to train the U-Net.In the final phase of data preparation, we segmented the large images from the original orthophoto and the associated 3-class binary label into smaller, equally-sized tiles using a sliding window method with a 50% shifting/pixel overlap.This overlapped tiling allowed us to generate more data and hence facilitated the training of a more accurate model.The labelled images were then divided into training, validation, and testing sets in an 80:10:10 ratio, ensuring a comprehensive evaluation of the model's performance.
Our adaptation of U-Net was further refined to address the class imbalance challenge, a common issue in image segmentation.We employed data augmentation techniques and additionally, also experimented with specialized loss functions to see if they help balance the representation of different classes.Figure 8 illustrates the developed methodology for sinkhole instance segmentation.
The image tiles in our study are processed through a U-Net implemented in PyTorch Lightning, beginning with a double convolution block to extract basic features such as edges of different angles.This initial block consists of a convolutional layer, batch normalization and ReLU nonlinearity, repeated twice, where batch normalization helps to decouple the convolutional layers for better convergence during training.Following this initial stage, the U-Net architecture includes four down-sampling stages in the encoder, each comprising a max-pooling layer and a double convolution.This structure progressively learns more abstract features, with input/output channel tuples increasing from (64, 128); (128, 256); (256, 512) and finally to (512, 1024) throughout the stages.The encoder's compressive path is mirrored by an expanding decoding path with four up-sampling stages, each consisting of a transposed 2D convolution, followed by a double convolution block.The channel tuples in these stages reverse the encoder's pattern, decreasing from (1024, 512) to (128, 64).In the last layer, adapted for our ternary segmentation task, a 2D convolution layer aggregates 64 channels into three: one for the background, one for edges between sinkholes, and one for the sinkholes themselves.We kept the skip connections between the encoder and decoder, allowing unimpeded information flow across, enabling the decoder to access detailed information from the encoder.Our U-Net model is flexible to input sizes, but for this study, we focused on 128 × 128 image patches.[54].The workflow begins with pre-processing the mask image (STEP 1) to detect edges between sinkholes, transforming the original two-class mask (Background and Sinkhole) into a three-class mask (Background, Sinkhole, and Edge Class).The input RGB orthophoto and the generated three-class mask are then used to train the multi-class U-Net model (STEP 2).The best-trained model is then applied to segment the full orthophoto, generating a semantically segmented mask (STEP 3).This mask undergoes a post-processing step (STEP 4) to generate the final instance segmentation mask image.

Transition from Higher-to Lower-Resolution Satellite Imagery
Our research pivots on the use of high-resolution drone images in the first phase.These images, owing to their level of detail, allowed for intricate mapping, annotation, and sinkhole detection.Training our deep learning models on this dataset ensured a robust understanding of sinkhole morphologies, their varied appearances across different terrains, and the intricate details that separate them from the surrounding landscape.
In the second phase, our research confronted a particular challenge for the field: How do we leverage the knowledge acquired from high-resolution images when faced with lower resolution satellite data?For this part, we turned to satellite images from the year 2022, which inherently lack the details present in drone samples.The transition involved several key modifications, which will be listed below.

Addressing Combined Sinkholes
In the first phase, employing high-resolution drone images, the distinction between combined (merged) sinkholes was prioritized for various reasons.Foremost, a clear understanding of individual sinkhole boundaries is pivotal for advanced sinkholes hazard mitigation and monitoring efforts.This demarcation helps in comprehending sinkhole merging patterns, useful for nuanced decision-making within sinkhole management activities.Training with the additional 'Edge' class broadens the model's exposure and is an extra step towards generalization.Delineation between merging sinkholes helps more accurate tracking and offers insights into sinkhole growth and future potential developments.
However, as the study transitioned to low-resolution satellite images in the second phase, adjustments were imperative.The reduced granularity of these images constrains the discernment of boundaries between closely clustered sinkholes.Thus, recognizing them as a unified instance became more accurate and avoided data extraction errors.This approach better aligns with practical scenarios where the overarching objective is to identify a broader hazardous area rather than discrete sinkholes.Also, given the guidance of annotations for satellite images through centre points from the high-resolution dataset, an attempt to define boundaries in clustered sinkholes could jeopardize annotation consistency.Lastly, considering the limited number of combined sinkholes in the first place, recognizing them as singular instances alleviated the data imbalance issue, ensuring a more adequate dataset for model training.

Modifications in Data Pre-and Post-Processing
Transitioning from high-resolution drone to low-resolution satellite images required some pre-processing steps to be modified.For the satellite imagery, histogram equalization was applied to enhance image contrast, and additionally mean subtraction was further completed to centre the pixel values, optimizing it for transfer learning.An important difference to reiterate is that the drone-based pre-processing method put an emphasis on identifying combined sinkhole boundaries, employing edge detection and dilation techniques to label transitions between sinkholes.In contrast, the satellite case, aligned with the decision to consider combined sinkholes as singular entities, omitted these steps, accommodating the lower resolution limitations and the goal for robust annotations.Thus, only two classes were used in this approach: 'Sinkhole' class, and 'Background' or 'Nonsinkhole' class.

Transfer Learning and Freezing of Certain U-Net Layers for the Satellite Case
In transitioning to satellite imagery, we applied transfer learning by initializing our satellite experiment models with the best weights obtained from the drone experiments and continuing training with satellite data.This approach involved strategic decisions on which layers to freeze and which to fine-tune.The key scenarios were as follows: Freezing Initial Encoder Layers: By freezing the early layers, we took advantage of the recognition capability of basic features, e.g., patterns and textures with various angles learned in the drone training phase.We assume that these fundamental features are generally transferable and useful across different datasets.
Freezing Half of the Encoder Layers: This strategy extends beyond basic features, transferring more complex feature combinations learned from the drone data.We assume that the effectiveness of this method varies, as these complex features may or may not be as relevant for satellite data.
Freezing the Entire Encoder: Here, only the decoder was fine-tuned.We anticipated potential limitations since the encoder's ability to adapt to the complex, special features of the satellite dataset was restricted.
Unfreezing the Entire Encoder: This scenario entailed training on satellite data with all layers of the U-Net, including both the encoder and decoder.This approach allows for comprehensive fine-tuning using the new data, benefiting from the efficient starting point provided by the drone-trained weights.A good starting point for the weights also ensures quicker convergence to an optimal set of weights for the satellite dataset case.Although this method allows the model to learn new features from the satellite data, it may lead to some loss of previously learned information from the drone data.However, with this partial loss of information we gain also the benefit of adapting more flexibly to the new datasets, taking advantage at the same time of good initialization points.We can minimize this partial loss to some extent by choosing a more gradual re-training process, with smaller learning rates.Comparing the results from the 'Unfreezing the Entire Encoder' experiment to the other experiments provides valuable empirical insights into the tradeoffs between potential risks and gained benefits of this approach.

Model Evaluation
The model's performance and accuracy in detecting and segmenting sinkholes from satellite and drone images were evaluated using multiple performance metrics.Considering the safety risks associated with undetected sinkholes (False Negatives) and the potential costs of monitoring False Positives, the metrics prioritized minimizing false negatives over false positives, i.e., we penalize more the cases of not detecting a 'Sinkhole' class.We computed the following metrics: model accuracy, specificity, per class precision, recall, F1 score, i.e., dice score.(refer to Table 2).Below are some brief definitions: Confusion Matrix: A table used to describe the performance of a classification model by comparing the predicted class for each data instance to its actual class label [62].
True Positives (TP): These are pixels correctly identified as belonging to the target class.For the 'Sinkhole' class, it represents the number of pixels that are correctly identified as 'Sinkhole' in the prediction, while also classified as 'Sinkhole' in the ground truth.
True Negatives (TN): In our multi-class segmentation context, TN for a specific class refers to pixels that are correctly identified as not belonging to that class.To calculate it, we assume all pixels not involved in TP, FP and FN for a class are TNs.
False Positives (FP): These are pixels incorrectly labelled as belonging to the target class.For the 'Sinkhole' class, it represents the number of pixels that do not actually belong to a sinkhole, but are predicted as such.
False Negatives (FN): These are pixels that belong to the target class, but are not identified as such.For the 'Sinkhole' class, it represents pixels that are truly part of a sinkhole, but missed (i.e., predicted as either 'Background' or 'Edge').
Specificity: measure of the model's ability to correctly identify true negatives (TN), i.e., correctly predict the absence of a condition.It is calculated as:

Specificity = TN/(TN + FP)
Recall (also known as Sensitivity): represents the model's ability to correctly identify all actual instances of a specific class.It is the percentage of correctly predicted class pixels out of the total existing pixels of that class.For the 'Sinkhole' class, it is calculated as: Recall sinkhole = TPsinkhole/(TPsinkhole + FNsinkhole) Precision: the percentage of correctly predicted class pixels out of all pixels predicted as the class of interest.For the 'Sinkhole' class, it is calculated as: Precisionsinkhole = TPsinkhole/(TPsinkhole + FPsinkhole) F1 Score: Harmonic mean of precision and recall for each class.F1 score is used to find an equilibrium between the reliability of positive predictions and the model's ability to detect positives.For the 'Sinkhole' class, it is calculated as: Accuracy: The proportion of correctly identified pixels for a specific class (both TP and TN) relative to the total number of pixels in the image.It is calculated as:

Experiment Setup
We used PyTorch and PyTorch Lightning as our frameworks of choice for all the experiments and took advantage of the scaled training capabilities they provide over multiple compute nodes and GPUs.Microsoft's Neural Network Intelligence (NNI) tool was utilized to explore the search space for different hyperparameters, such as batch size and learning rate, thus maximizing model performance on the available dataset.We can highlight the details of our best experiment setup: a batch size of 64, 1000 epochs (In deep learning, an epoch refers to one complete cycle through the entire training dataset during the model's training process.We cycle through several epochs to complete the training, i.e., tune the network weights) of training using the Adam optimizer (The Adam optimizer is an algorithm for optimizing neural networks, combining the advantages of AdaGrad and RMSProp to adjust learning rates based on recent gradient changes, enhancing the efficiency and speed of training) [17] and a learning rate of 0.0003.All convolutional layers were set to have a 3 × 3 kernel (In a convolutional neural network (CNN), a kernel is a small matrix used to apply a filter across an input image to extract features such as edges and textures by performing convolution operations) size, and padding was enabled to ensure that the output feature maps maintain the same spatial dimensions as the input.To increase the training data volume and enhance the network's generalization performance, data augmentation techniques such as rotation, horizontal and vertical flipping of the image patches were employed using the 'albumentations' library.
We made use of three distinct loss functions: non-weighted cross-entropy, weighted cross-entropy, and focal loss.We deliberately chose non-weighted cross-entropy to begin with, because the task of semantic segmentation can be viewed as a per-pixel multiclass classification task, where each pixel decides on a categorical label.Cross-entropy loss is therefore a desirable choice, especially for such a multi-class scenario.The alternative would have been dice loss, but since this one suffers from gradient instabilities, we decided in favour of nonweighted cross-entropy, with the added constraint of evaluating on the dice metric instead.Later on, when faced with data imbalance issues, we chose two more functions that would put more focus on the minority classes, while at the same time being natural extensions of the parent loss function, i.e., the nonweighted cross-entropy.The first choice was naturally weighted cross-entropy, which puts a fixed/static attention on the minority pixels.Subsequently, we chose a more dynamic/adaptive attention using the focal loss, to contrast it with the fixed scenario of weighted cross-entropy.Focal loss offers a more soft, gradual/adaptive attention on the minority pixels throughout the training procedure as evidenced by Lin et al. [63].

•
Non-Weighted Cross-Entropy Calculated as: where y is the ground truth and p the predicted probability for class with label 1.This loss function measures the disparity between predictions and actual values, treating all classes equally.
• Weighted Cross-Entropy Changes the standard cross-entropy by introducing weights for classes: with w being the weight assigned to each class.This approach gives higher importance to underrepresented classes.We assigned weights inversely proportional to the number of pixels in each class, i.e.,  = . •

Focal Loss
Expressed as: where αt and γ are hyperparameters, and pt is the model's estimated probability for the class with label t.Focal loss dynamically focuses on challenging misclassifications during training, adapting to problematic cases in a more responsive manner than weighted crossentropy.This adaptability may allow for a quicker and more efficient training process, as it prioritizes difficult-to-classify instances on the fly.Note that Focal Loss is a generalization of the non-weighted cross-entropy loss, with γ = 0 and  = 1 for all classes.Progress monitoring involved computing the validation loss, and dice score (see Figure 9) after each epoch.The model checkpoint was updated each time the validation dice score improved.We used early-stopping to prevent overfitting, i.e., not to learn by heart the train dataset, which is usually correlated with an inability to generalize to other thirdparty datasets.More concretely, if the validation score did not improve after a tolerance threshold of 10 epochs, then we stopped training.The best model was evaluated quantitatively on the test dataset by calculating the test metrics, as well as qualitatively by storing the prediction maps for visual inspection (see Appendix A).

Performance of the Model
The initial results were obtained by training the model using the drone image dataset.Table 2 provides the outcomes from the various experiments performed.These results underscored the importance of choosing and tuning the appropriate loss functions, primarily due to the imbalanced nature of the dataset.To monitor the minority classes, i.e., sinkholes and edges, per-class metrics were reported.Regarding the accuracy metric, we would like to point out that because of the very big number of background pixels, the 99% mark was consistently reached.

Performance Analysis across Datasets
In this section, we highlight how the model performed across the high-resolution drone and lower-resolution satellite imagery (See Table 2).This includes looking at the recall for the 'Sinkhole' class for accurate sinkhole detection (maximizing TP) and lowering risk by reducing missed detections (minimizing FN).In addition, we will consider the F1 Score which serves as a key metric for balancing precision and recall, indicating the model's overall effectiveness in identifying sinkholes.Moreover, given the class imbalance, specificity and accuracy might be less indicative of model performance for sinkhole detection compared to recall and precision.

Phase I-Trained with Drone Images
The experiments demonstrate notable consistency in achieving high precision and recall for the 'Sinkhole' class across different loss functions, with the precision for 'Sinkhole' remaining above 89% across all experiments.The highest sinkhole recall, achieved using non-weighted CE, stands at 96.79%, alongside an F1 score of 97.08%.The 'Edge' class, on the other hand, reveals significantly lower precision and recall across all experiments, barely reaching the highest recall of 17.24% with both non-weighted CE and focal loss (Gamma = 1), and an F1 score of 17.761 achieved through non-weighted CE.Meanwhile, the 'Background' class consistently exhibits high performance, which can be attributed to its majority representation in the dataset (See Table 2A).

Phase II-Trained with Satellite Images
In the second phase of our research, we aimed to maintain the model's performance in identifying the 'Sinkhole' class on lower resolution satellite images via transfer learning.Our experiments revealed a consistent performance in detecting and delineating sinkholes, where 'Freezing Initial Encoder Layers' was the most effective strategy, achieving a recall of 92.055% and an F1 score of 91.228%.This was closely followed by 'Unfreezing the Entire Encoder', then 'Freezing Half of the Encoder Layers', and finally, 'Freezing the Entire Encoder' showing the least effectiveness.In all experiments, the precision and F1 Score for the 'Sinkhole' class remained high, above 85.088% and 83.541%, respectively (See Table 2B).It is worth noting that the competitive results of 'Unfreezing the Entire Encoder' indicate that the model is capable of adapting to new data, flexibly altering all the drone experiment weights in a holistic manner, while still maintaining a reasonable level of performance.

Discussion
This work demonstrates the capability of our implemented U-Net-based pipeline to accurately detect sinkholes.The system's effectiveness in segmenting separate sinkhole instances with accurate detection of their boundaries was particularly evident when trained with high-resolution drone images.Scholars have highlighted the importance of high-resolution data in enhancing the accuracy of geological hazard detection [64] and our results confirm this perspective, yet we also show that considerable accuracy can be maintained even with lower resolution images through techniques like transfer learning.Throughout both deployment phases-from high-resolution drone to lower-resolution satellite imagery-the model maintained consistent high precision and F1 scores for the 'Sinkhole' class (Table 2).Such consistency under varied imaging conditions and resolutions is critical in minimizing potential risks associated with inaccurate sinkhole detection [3,65].Our findings confirm the scalability and capability of the U-Net architecture to effectively detect sinkholes from aerial data, as previously noted by Mihevc and Mihevc [31].In addition to the quantitative evaluations, we carried out qualitative assessments throughout visual inspection of prediction maps (see Appendixes A and B), because sometimes just looking at the numbers may not convey the full picture.

Challenges in Sinkhole Edge Detection
One of the major challenges we faced was the accurate segmentation of edges between merged sinkholes, primarily due to class imbalance and the less distinct nature of these features compared to the sinkholes themselves.This difficulty aligns with the findings from Kang et al. [30], who noted the challenges in detecting sinkholes within narrowly defined areas and diverse datasets, and also echoes concerns noted by Nefeslioglu et al. [32] about the complexities involved in distinguishing closely spaced geological features due to overlapping characteristics.
Edges are inherently more challenging to detect than other classes, because they represent thin, often ambiguous transitions between distinct sinkholes, which can be difficult for the model to learn and generalize.In addition, this class contains significantly fewer training samples compared to the 'Sinkhole' and 'Background' classes, which constitutes an imbalance that affects the model's ability to accurately learn its characteristics.This imbalance is reflected in the lower recall and F1 scores for the 'Edge' class, as the model tends to have a bias towards the more prevalent classes.
One other aspect we can point out contributing to low 'Edge' recall and F1 scores, is that there is a very high inter-class similarity between sinkhole pixels and edge pixels.One can say that an edge pixel is actually a sinkhole pixel that belongs to several sinkholes simultaneously.Given that the sinkhole class can be regarded as subsuming the edge class in some sense, it is understandable that the neural network has difficulties predicting the edge class in particular.
Yet another important point is the fact that we apply the computer vision morphological operation of dilation as a pre-processing step to the thin edges derived via customized Sobel edge detection.This is accomplished to mark the transition regions between two merged sinkholes and emulate the natural uncertainty that even human experts experience when demarcating the exact boundaries where one sinkhole ends and the other one begins.The dilation, however, does have also a confusing effect for the CNN, because naturally some pixels from the sinkhole class are put in the edge set.Hence, the CNN will encounter some ambiguity stemming from occasional data mixing related to dilation.Nevertheless, we do believe in keeping the dilation step to encode uncertainty and let the network reduce this uncertainty in a data-driven manner, while we scale and accumulate more data over time, from new sources.And to mitigate the data mixing issue, the dilation operation can be coupled with an additional fuzzy logic block, where we label pixels probabilistically, e.g., a certain pixel is 70% sinkhole-like, and 30% close to a 'pure' edge.We would like to pursue this direction, as it resembles human intuition, amongst other things.Currently, we are forcing the CNN to draw a hard distinction between two ambiguous classes, when in fact we may benefit from a softer decision-making between the two.
Despite these challenges, the model showed a degree of success in edge classification, laying initial groundwork for further improvement in future studies.On the other hand, the 'Background' class demonstrated high performance, facilitated by its majority representation in the dataset.The ease of classifying 'Background' pixels also contributes to the model's overall effectiveness in distinguishing salient sinkhole features from their surroundings, which is critical for generating accurate sinkhole maps.

Handling Class Imbalance
Class imbalance in our dataset posed a significant challenge, affecting the model's ability to learn from less represented classes such as 'Sinkhole' and 'Edge'.The literature confirms that the effectiveness of machine learning models in environmental applications is heavily reliant on the balance and representation of classes within the training data [31,64].Recognizing this, we have adopted strategies such as using weighted cross-entropy and focal losses.Both of these loss functions narrow the attention of the model in the initial cycles of the training towards the minority classes, either in a more fixed/static manner in the case of weighted cross-entropy loss, or more adaptively in the case of focal loss, as presented by Lin et al. [63].However, adapting these strategies in practice proved to be more challenging than expected and surprisingly: non-weighted cross-entropy provided better results for the 'Sinkhole' and 'Edge' minority classes with the least amount of energy (Table 2).We would like to note that so far, we took a principled approach, and in the case of weighted cross-entropy for example, we made the class weights inversely proportional to the number of class pixels available.But technically, these class weights can be searched more empirically, i.e., by brute force within a broader search space.The search space can also be extended for the gamma parameter in the case of focal loss as well.Therefore, we generally assume that by expanding the search space for the classweights in the case of weighted cross-entropy loss; and gamma hyperparameter in the case of focal loss we might reach better local minima for our model.However, this of course comes at a cost of much more training resources and GPU time.For efficiency reasons therefore, we would like to expand the search space for these alternative loss function hyper-parameters once we gather more data for the edge class either from new sources, or from super-resolution techniques.We assume that applying our above-mentioned loss functions is indeed promising, provided that one reaches a data quantity threshold, i.e., one has a critical mass of samples available.In our case, the sinkhole pixels are very underrepresented with respect to the background, and the edge pixels are extremely underrepresented.Hence, we would like to collect especially more edge samples in the future and increase this type of data pool in particular.
Considering this, as well as the points mentioned in Section 5.1, dealing with class imbalance is a complex matter, requiring not only expanding the search space for the hyperparameters of our alternate loss functions, but also broadening the data pool for the minority classes, as well as better curating these samples to minimize data mixing, and employing fuzzy logic to encode uncertainty and enable soft decision-making, rather than hard demarcation of naturally ambiguous classes.

Effectiveness of Transfer Learning
Our research highlights the model's adaptability across different resolutions and imaging conditions through strategic application of transfer learning [53].This adaptability is important for practical applications, ensuring that the model can be deployed in various real-world scenarios with different data quality and resolutions [65].A key to our success was the strategic freezing and unfreezing of specific layers within the U-Net architecture, which played an important role in achieving high precision and recall for the 'Sinkhole' class.Especially beneficial was the 'Freezing Initial Encoder Layers' approach.It capitalized on the fundamental features recognized from the high-resolution drone imagery, effectively transferring this knowledge to interpret the lower-resolution satellite images.Karpatne et al. [65] and Ma and Mei [64] further reinforce the importance of transfer learning for a wider applicability across fields and aerial data distributions.

Model Generalisability to Other Karst Environments
The adaptability of our model to different geological settings and multi-resolution scenarios broadens its applicability and utility for geohazard management.However, computer vision models for landform mapping can produce unexpected predictions when applied to a different geographical area than that where they were first trained [33].This so-called 'out-of-distribution' phenomenon is one of the greatest challenges for machine learning mapping and requires considerable attention.This is especially true in karst environments, whose landscape configuration is highly variable.Unique environments can develop within very small areas, and their characteristics depend upon many factors, including the lithology and structure of the host rocks, the present and past climates which have prevailed in a given karst area, and the surface and subsurface hydrological conditions [1,66,67].
The karst environments on the shores of the Dead Sea have formed by dissolution and physical erosion of subsurface evaporite deposits, which are interlayered with poorly consolidated alluvial and lacustrine sediments [34,35,37,40].As the climate is very arid, there is very little surface water or vegetation present in the study area, meaning that there is not really an epikarst layer present.Dissolution therefore is almost absent as a surface process: collapse into subsurface voids is the primary mechanism of sinkhole formation, along with surface sagging across broader areas, with wide areas of subsidence and coalescence of sinkholes forming larger depressions [8,36].The resulting landscape is one in which optical imagery allows clear delineation of sinkholes by the human eye (Figure 2C,D; Figure 6), particularly with respect to the open sinks which form at the margins of stream-channel meanders (Figure 5A).Although sinkholes do have different morphologies, and thus different visual characteristics when formed in the alluvial fan deposits, as compared to the lacustrine mud deposits (cf. Figure 10, [35], and Figure 5, [36]), they can both be accurately delineated from optical imagery alone, by our model.However, this may not necessarily be the case in solutional karst environments, where shadowing and colour gradients between sinkholes and the background image are far less pronounced.In such environments, a hill-shaded elevation model is likely to be more suitable as input data for classifying sinkholes [31].Furthermore, the general absence of vegetation at Ghor Al-Haditha also lends itself to sinkhole detection from optical imagery.
There is considerable scope for applying our model to other karst environments, though further training and validation would be required to ensure accurate transferability.Fine-tuning the model would have to be carried out on additional datasets that capture the variety of sinkhole morphologies occurring in different geological and climatic settings, along with different vegetation covers and optical characteristics.For example, in a forested karst landscape, our approach would likely require significant adaptation, as vegetation would obscure the true land surface.For this case, it might be possible to incorporate LiDAR and multispectral data, which can be corrected to remove vegetation [68].In urban environments, occlusion of sinkholes would present additional challenges, as the visual appearance and morphology of sinkholes will differ from natural cavities due to the influence of anthropogenic structures such as buildings and vehicles [69].It may be anticipated that, as the number of recognised sinkhole occurrences in urban areas has increased substantially in recent decades [70], adaptation of our model to urban landscapes may be especially important.Adaptive learning methods can be used to allow the model to dynamically adjust to new data distributions and enhance its performance in different environments.Techniques such as domain adaptation and domain generalization can help the model learn invariant features that are relevant across various settings [71].

Conclusions
Our research, focusing on the identification and mapping of sinkholes in the evaporite karst at Ghor Al-Haditha on the eastern shore of the Dead Sea, demonstrated the effective use of a system designed for geological structure recognition and centred around the U-Net architecture.The research was carried out in two phases.Initially, the model was trained, validated, and tested using high-resolution drone-based orthophoto images (0.1 m GSD) captured in December 2016 and covering 250 different sinkholes (see Figure 5F).In the second phase, the model was fine-tuned and tested on a larger dataset with lower resolution from a Pleiades Neo satellite image (0.3 m GSD) covering 1038 different sinkholes.
The methodology highlights a strategic layer freezing and unfreezing during the training process, which supports the model's adaptability to different image resolutions.Our dual-phase approach has consistently returned high recall and F1 scores for the 'Sinkhole' class under various imaging conditions.Notably, the highest recall in Phase I was achieved using non-weighted CE, at 96.79%, alongside an F1 score of 97.08%.In Phase II, the 'Freezing Initial Encoder Layers' strategy achieved a recall of 92.06% and an F1 score of 91.23%, showing the robustness and effectiveness across input scales.
Furthermore, the deliberate use of RGB-only visual bands in aerial data-previously considered as not useful by some authors [33]-proved to be promising in our methodology.This broadens the model's applicability and enhances scalability due to more readily available data inputs.
The model tries to address the technical challenge of class imbalance via the use of more sophisticated loss functions, such as weighted cross-entropy and focal loss.However, further fine-tuning of class weights and gamma is necessary for these loss functions to enhance the results beyond the non-weighted cross-entropy baseline.This, however, should be completed with a larger and better curated dataset for the edge class in particular.Additionally, given that we applied dilation as a preprocessing step to encode transition region uncertainty within merged or coalesced sinkholes, we intend to pursue fuzzy logic as a means towards soft decision-making for 'sinkhole vs edge' classification, to accommodate the high inter-class similarity of these minority classes.Since the sinkhole sample-set can be regarded as subsuming the edge sample-set, this would be a natural and promising next step to follow.

The Satellite Images from The Year 2022 Associated Ground Truth Semantic Segmentation Mask Image
Table A3.Semantic and Instance Segmentation Results (Phase II).Note: As the images are very large, we will provide a sample from a selected area (reflecting the same area covered by the drone dataset) to show the different results obtained from the different experiments.

Semantic Segmentation Instance Segmentation
Freezing Initial Encoder Layers Freezing Half of the Encoder

Figure 1 .
Figure 1.Schematic representations of different 'top-down' algorithmic methods of delineating sinkholes.(A) method of O'Callaghan and Mark,[21], which maps the depressions according to simulated stratification of water within them.Adapted with permission from[21], 2024, Elsevier.(B) the 'D8' method of Jenson and Domingue[22], which uses a moving window to map the watersheds within the depression.This method and that shown in (A) become very computationally intensive with high-resolution data.(C) the 'priority fill' method of Wang and Liu,[23], which is able to simulate filling of the entire compound depression in one pass of processing.This method offers an improvement in run-time of a factor of 30 on (B), but is not able to capture the internal complexity of the compound depression.Adapted with permission from[23], 2024, Taylor & Francis.(D) The 'contour tree' method developed by Wu et al.[24,25], which builds on the 'priority fill' method to produce a graph ('tree') of contours within the compound depression, allowing nested depressions to be identified and labelled by their rank.This allows for more accurate automated updating of depression location and morphometric databases.The method has since

Figure 2 .
Figure 2. Overview of the study area.(A) ESRI satellite imagery of the Dead Sea.The location of part (B) is marked.(B) Pleiades 1-A satellite image from April 2018 of the Ghor Al-Haditha study area on the Dead Sea's eastern shore.The outline of data collected in the December 2016 drone survey, the extent of sinkhole formation across the study area and the position of the Dead Sea shoreline in 1967 are shown.Additionally, the areas covered by the datasets used for Phase I (Red) and Phase II (Grey) of our study are shown, as are the locations of parts (C) and (D), which depict sinkholes in both alluvium and mud materials as they appear in the 2016 structure-from-motion orthophoto and in Pleiades Neo satellite imagery from August 2022, respectively.Several new sinkholes have formed in 2022 as compared to 2016, and others have changed in shape and size.

Figure 3 .
Figure 3. Relevant computer vision problems within which we could frame our task.We chose image segmentation in the end.(A) Image classification: an entire image is classified according to a label.(B) Object detection: the task of detecting instances of objects of a certain class within an image.(C) Semantic segmentation: label each pixel of an image with a corresponding class, i.e., per pixel classification (D) Instance segmentation: label each pixel of an image with a corresponding class and detect instances of objects of each class within an image.

Figure 4 .
Figure 4.The different layers that were used to guide the annotation process for the training dataset in Phase I.The sinkhole cluster shown here is the same as that highlighted in Figure 5C-F.(A) RGB orthophoto mosaic.(B) DSM data visualized as a hill-shaded relief map.Contour lines generated from the DSM data with an interval of 1 m were also used.(C)Elevation profile generated withinArcGIS Pro (V.2.9) along the axis of a sinkhole cluster.The tool was used in special cases to find the exact edges, especially the edges between compound (merged) sinkholes, as presented in the image.

Figure 5 .
Figure 5. Generating the sinkhole instance segmentation mask image.(A) The selected area from the drone image for training sample generation.(B) Several depicted sinkholes.Note the 3 compound sinkhole instances.(C) Using different layers to guide the annotation process.(D) Different polygons were manually drawn for each sinkhole instance with precise edges.(E) Converted polygons to a raster layer where each sinkhole is presented using a different colour.(F) TIFF mask image with all the sinkholes in the selected area.

Figure 6 .
Figure 6.(A) Defining features of sinkhole outlines.(B) Mapping of sinkholes in the area.

Figure 7 .
Figure 7. (A) Drone RGB image for the research area, (B) Sinkhole instance segmentation label image as created for the drone image case, and (C) The derived 3-class label image.

Figure 8 .
Figure 8. Overview of the sinkhole instance segmentation pipeline used in phase I of the study.This diagram illustrates the multi-stage process used to train a multi-class U-Net model, adapted from Ronneberger et al.[54].The workflow begins with pre-processing the mask image (STEP 1) to detect edges between sinkholes, transforming the original two-class mask (Background and Sinkhole) into a three-class mask (Background, Sinkhole, and Edge Class).The input RGB orthophoto and the generated three-class mask are then used to train the multi-class U-Net model (STEP 2).The best-trained model is then applied to segment the full orthophoto, generating a semantically segmented mask (STEP 3).This mask undergoes a post-processing step (STEP 4) to generate the final instance segmentation mask image.

Figure 9 .
Figure 9. Model performance for Phase I as judged by the average dice score.

Table 1 .
Overview of relevant studies which have applied Machine Learning (ML) and Deep Learning (DL) to detect sinkholes from remote sensing data.

Table 2 .
Performance metrics for the developed models. (A

Table A1 .
Semantic and Instance Segmentation Results (Phase I).

Table A2 .
Satellite Image and Ground Truth Mask.