DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case

Huang, Huili; Zhang, Andrew; Zhang, Danrong; Roozbahani, Max Mahdi; Frost, James David

doi:10.3390/rs17162812

Open AccessArticle

DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case

by

Huili Huang

^1,*

,

Andrew Zhang

¹

,

Danrong Zhang

¹

,

Max Mahdi Roozbahani

²

and

James David Frost

³

¹

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

²

School of Computing Instruction, Georgia Institute of Technology, Atlanta, GA 30332, USA

³

School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2812; https://doi.org/10.3390/rs17162812

Submission received: 4 June 2025 / Revised: 4 August 2025 / Accepted: 11 August 2025 / Published: 14 August 2025

(This article belongs to the Special Issue Machine Learning at the Object: Fine-Grained Extraction and Analysis in Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Limited labeled imagery and tight response windows hinder the accurate damage quantification for post-disaster assessment. The objective of this study is to develop and evaluate a deep learning-based Domain-Adaptive Segmentation (DASeg) workflow to detect post-disaster damage using limited information available shortly after an event. DASeg unifies three Vision Foundation Models in an automatic workflow: fine-tuned DINOv2 supplies attention-based point prompts, fine-tuned Grounding DINO yields open-set box prompts, and a frozen Segment Anything Model (SAM) generates the final masks. In the earthquake-focused case study DASeg-Quake, the pipeline boosts mean Intersection over Union (mIoU) by 9.52% over prior work and 2.10% over state-of-the-art supervised baselines. In a zero-shot setting scenario, DASeg-Quake achieves the mIoU of 75.03% for geo-damage analysis, closely matching expert-level annotations. These results show that DASeg achieves superior workflow enhancement in infrastructure damage segmentation without needing pixel-level annotation, providing a practical solution for early-stage disaster response.

Keywords:

computer vision; disaster reconnaissance; remote sensing; deep learning; large foundation model; Segment Anything; earthquake damage analysis

Graphical Abstract

1. Introduction

In disaster assessments, timely and accurate damage quantification is critical to facilitate efficient evaluation and subsequent repair, particularly in regions with large populations [1]. Any delay in damage assessment can critically hinder search-and-rescue in collapsed structures and prolonged efforts to locate survivors trapped under rubble [2]. However, conventional damage assessments workflows often encounter a “cold-start problem” in the early post-disaster stage, where the initial phase of analysis is hindered by obstacles such as limited access to high-quality imagery (challenge A), unpredictable weather conditions (challenge B), inherent complexities of model training (challenge C), and the need for time-intensive expert annotation (challenge D).

Social networks contribute to alleviating the “cold-start” constraints by delivering ground-level post-disaster images within minutes of an event. Unlike satellite or drone imagery [3], social media platforms provide a large volume of low-cost, crowd-sourced images shortly after a disaster (Solution A). These user-generated images are generally unaffected by adverse weather conditions such as cloud cover, and offer wide geographic coverage through platforms like X (formerly Twitter) and Facebook (Solution B) [4]. In addition, state-of-the-art vision models are commonly pre-trained on similar ground-level scenes [5,6,7,8], making social media imagery well suited for downstream fine-tuning and inference (Solution C). However, challenge D remains unresolved: generating accurate pixel-wise segmentation labels for post-disaster images still requires intensive expert effort, which continues to bottleneck the deployment of a segmentation-based damage assessment workflow.

The recent development of Large Foundation Models (LFMs) offers a promising avenue to address the annotation bottleneck of challenge D. The LFMs has led to remarkable performance in a wide range of computer vision tasks [9,10,11]. Models such as DINO [12,13], SAM [14], and CLIP [15] learn generalizable visual representations from massive datasets and have achieved impressive results in well-understood fields [16]. These models are particularly attractive for disaster-response applications, where annotated data is limited and rapid deployment is essential. However, applying LFMs to post-disaster damage assessment workflows, particularly for the semantic segmentation of fine-grained and irregular damage patterns, remains a challenge. Collapsed buildings, scattered debris, and other damage-specific features are poorly represented in standard datasets such as ImageNet [17], COCO [18], and ADE20K [19], limiting model generalization in these scenarios. Consequently, LFMs like SAM and CLIP frequently underperform in domain-shifted settings such as satellite imagery, medical diagnostics, and disaster imagery [11,20,21].

To develop an enhanced workflow to detect disaster infrastructure damage when only limited post-event data are available, we propose DASeg, a Domain-Adaptive Segmentation pipeline tailored for early-stage post-disaster image analysis. The proposed model transforms an interactive prompt-based segmentation approach into a complete end-to-end semantic segmentation pipeline, enabling the semantic segmentation of infrastructure damage without requiring time-consuming pixel-level annotation. We select post-earthquake damage analysis as a use case to evaluate our pipeline’s effectiveness, which is referred to as DASeg-Quake henceforth.

Instead of relying on manually provided point and box prompts, DASeg generates them automatically using two pre-trained vision foundation models: DINOv2 [13] and GroundingDINO [22]. For point prompts, DINOv2 combined with Transformer Input Sampling (TIS) [23] identifies high-attention regions and extracts “attention points” that are likely related to damage areas. For box prompts, Grounding DINO [22] performs open set detection: Given text queries such as “damaged structure” or “debris,” it returns bounding boxes for the corresponding objects, allowing flexible adaptation to various damage scenarios. These automatically generated prompts are then passed to SAM to produce segmentation masks of damaged regions, eliminating the need for any pixel-level annotations.

The evaluation of DASeg-Quake highlights its performance comparable to state-of-the-art supervised learning models such as BEiT [8], Mask2Former [7], DeepLabv3+ [5]. Furthermore, the zero-shot DASeg-Quake indicates strong domain adaptability, achieving near-expert pixel-level accuracy and reasonable boundary delineation in earthquake-induced geological damage prediction. Together, these findings suggest that DASeg-Quake offers a reliable end-to-end workflow to identify disaster-affected areas using only limited information available shortly after an event.

2. Related Work

2.1. Limitations of Aerial-Based Disaster Damage Assessment

Previous disaster damage detection studies frequently used macro-level imagery, including satellite sources such as USGS Landsat [24], ESA Sentinel via Copernicus [25], and Maxar Open Data [26], as well as aerial data supported by reconnaissance team such as GEER (Geotechnical Extreme Events Reconnaissance) [27], NOAA(National Oceanic and Atmospheric Administration) [28]. The data is always high-resolution with detailed information about the post-disaster.

Despite their value, the process of collecting and analyzing these data in real-world post-disaster environments presents substantial challenges. Data acquisition often relies on expensive technologies such as Global Positioning System (GPS) and Unmanned Aerial Vehicles (UAVs) [29], both of which are susceptible to disruptions caused by adverse weather [30]. In particular, persistent cloud cover can severely hinder the acquisition of clear aerial images [3]. Furthermore, the subsequent analysis process is typically time-consuming. Although certain types of geospatial data, such as ShakeMaps, peak ground acceleration records, surface displacement measurements, and landslide or liquefaction susceptibility maps, can be produced without interference from cloud cover, they lack the granularity needed for accurate infrastructure-level damage assessments.

A case in point is the 6 February 2023, Türkiye Earthquake. Within three days, the GEER team released a preliminary virtual report summarizing the initial geotechnical findings [31]. However, more detailed structural damage assessments were based primarily on images obtained from the social media platform X. GEER’s in-person UAV-based field reconnaissance did not begin until early March, allowing for more accurate evaluations, but arrived too late to inform rapid response strategies [32].

Although deep learning methods have been increasingly applied to support damage detection, many existing models remain limited to binary classification, differentiating only between damaged and undamaged structures. For example, in response to the 2023 Türkiye Earthquake, the Microsoft AI for Good and Microsoft Philanthropies team fine-tuned the Convolutional Neural Networks (CNNs) to assess infrastructure damage using satellite imagery. However, these simplified outputs still often require expert post-processing to derive actionable insights [33].

2.2. Disaster Damage Assessment Studies on Social Media Images

The majority of studies in disaster damage assessment focus on classifying the damage levels of infrastructure [4,34,35,36,37]. Most datasets inherit their labeling strategy from the Damage Assessment Dataset (DAD) and classify structural damage into three categories: Little-to-no, Mild, and Severe. In addition to predicting disaster damage, researchers have proposed various multitask classification datasets for social media images [34,35,36,37]. Alam et al. proposed MEDIC, the multitask classification dataset comprising 71,198 images. The dataset includes four tasks: disaster types (earthquake, fire, flood, hurricane, land slide, no disaster, and other disasters), informativeness (informative and not informative), humanitarian needs with four class labels (Affected/Injured/Dead people, Infrastructure and utility damage, Rescue volunteering or donation effort, Not humanitarian), and damage severity (Little-to-no, Mild, Severe). The damage severity labels in MEDIC are derived primarily from DAD. However, researchers have identified issues in DAD labeling descriptions, noting their subjectivity and the presence of duplicate images, which pose challenges for subsequent analyses [38]. To address these limitations and establish a high-quality dataset with a systematic labeling framework, Huang et al. introduced the Earthquake Infrastructure Damage (EID) dataset [38]. Labeled by expert annotators, the EID has achieved state-of-the-art performance in earthquake disaster analysis and serves as a training dataset to generate point prompts in this study.

Beyond mere classification, researchers have focused on segmenting disaster-damaged areas within social media images [39,40,41]. Alam et al. [42] provided visual explanations using Grad-CAM. These explanations are based on the benchmark dataset and employ models such as VGG16 and EfficientNet. In particular, even when the model classifies accurately, the resulting heatmaps often need to be revised to be interpreted. Similarly, Li et al. [39] computed the average of the

14 \times 14

CAM map on VGG16 and introduced the continuous value termed DAV. This metric aims to quantify the degrees of damage, offering an alternative method to harness this valuable information. However, the applicability of this method remains limited. Only ten images are evaluated for semantic segmentation results. Shekarizadeh et al. proposed an unsupervised model named Deep-Disaster that leverages Knowledge Distillation (KD) to identify damaged zones [41]. Previous semantic segmentation works lack detailed evaluation of semantic segmentation due to the absence of a labeled ground truth dataset. Zhang et al. introduced the Damage Semantic Segmentation (DSS) dataset [43], the first semantic segmentation dataset dedicated to earthquake disaster damage analysis. This paper leverages DSS to comprehensively evaluate both previous methods and DASeg-Quake.

2.3. Semantic Segmentation of SAM

The SAM performs prompted segmentation without incorporating label information from the images. Users have the option to generate masks with SAM in an unlabeled form or to provide prompts to guide the segmentation process interactively. To provide meaningful guidance for SAM, the model is integrated with other methods such as Grounding DINO [22], ChatGPT [44,45], Stable Diffusion [46], CLIP [47], etc. The basic idea behind these integrations is to complement each other’s strengths to generate semantic information about SAM masks. SAM2 further generalizes the paradigm to time-varying data, offering unified prompted segmentation for both images and videos [48].

In remote sensing analysis, the Segment Anything Model (SAM) has gained attention as a valuable tool for facilitating image segmentation. Most related research focuses on drone and satellite images. The Python 3.10 package SAMGeo offers robust functionalities for segmenting geospatial data using SAM [49]. With minimal human input, such as a simple bounding box or a single-point annotation, SAM can generate reasonable segmentation results for applications such as land cover mapping, urban expansion monitoring, and land use change detection. However, SAM faces several challenges when applied to remote sensing data: (1) the complex nature of object features and their surrounding environments [9], (2) the limited availability of expert annotations [16], and (3) the absence of remote sensing-specific data in the pre-training corpora of commonly used foundation models [38].

To overcome these challenges, researchers have proposed various modifications to improve SAM performance in remote sensing. Chen et al. introduced RSPrompter, a prompt-learning approach tailored to SAM to generate appropriate prompts for remote sensing images [50]. Yan et al. developed RingMo-SAM, which integrates a specialized prompt encoder to improve the segmentation accuracy of SAM for both optical and SAR remote sensing data [51]. Pu et al. [52] introduced the Classwise-SAM-Adapter, an adaptation of SAM designed for the classification of land cover in spaceborne SAR imagery. Furthermore, Zheng et al. proposed MC-SAM SEG, a multi-cognitive SAM-based instance segmentation model specifically optimized for remote sensing applications [53]. However, all of these strategies still rely on large, domain-specific datasets and require fine-tuning of SAM or its adapters, a combination that limits their practicality in data-scarce or time-critical scenarios such as rapid disaster response.

2.4. Explainability in Computer Vision

In the context discussed in Section 2.2, a noticeable gap exists in the provision of segmentation datasets tailored to the analysis of disaster damage derived from social media images. Implementing the class attention map for Weakly-supervised semantic segmentation (WSSS) [54,55,56] is an intriguing alternative. Although both image- and pixel-level labels present acquisition challenges, the former is comparatively easier to obtain.

Class activation mapping (CAM) is among the most widely adopted techniques for interpreting the predictions of convolutional neural networks (CNNs). CAM is commonly derived through a three-step process. First, each feature map is multiplied by its corresponding weight. The resulting weighted feature maps are then aggregated through summation. Finally, the Rectified Linear Unit (ReLU) function is applied to eliminate negative activations. The strategy for obtaining these weights can vary across different attention methods. For example, CAM [57] employs a strategy that acquires weights by transforming the initial fully connected layer (FC) into a global average pooling layer. Grad-CAM [58] and Grad-CAM++ [59] determine weights by computing the gradients of the target relative to each feature map, followed by averaging each feature map. LayerCAM [55], on the other hand, integrates spatial data from intermediate convolutional layers and uses pixel-wise score distributions. This approach yields detailed, high-resolution class activation maps, enhancing model interpretability.

Explainability methods for Vision Transformers (ViTs) can be broadly categorized into attention-based, gradient-based, and perturbation-based approaches. Attention-based techniques, such as the Attention Rollout proposed by Abnar et al. [60], create saliency maps by aggregating attention heads via operations including averaging, minimization, or maximization. This technique also incorporates an identity matrix to account for residual connections, which are essential for monitoring information flow across layers. Among gradient-based approaches, Partial Layer-wise Relevance Propagation (Partial LRP) [61] computes the importance of each attention head leveraging the layer-wise relevance propagation principles. Extending this method, Chefer et al. [62,63] generated class-specific explanations by integrating LRP with gradient information through a rule-based procedure that selectively skips or adds layer information. Unlike these methods, perturbation-based approaches do not rely on attention weights or gradients but instead utilize masks derived from path embeddings to assess relevance. ViT-CX [64] introduces a mask-based technique that clusters embeddings to reduce the number of masks, ultimately generating the saliency map through a bias-corrected summation of the pixel coverage. Building on ViT-CX, Englebert et al. proposed Transfer Input Sampling (TIS), which constructs saliency maps by analyzing perturbations induced by sampling input tokens.

3. Materials and Methods

The proposed DASeg pipeline is illustrated in Figure 1, which comprises three primary components: point prompt generation, box prompt generation, and SAM-based segmentation. Specifically, for point prompt generation, we first fine-tune DINOv2 using a classification dataset and then derive point prompts from high-saliency regions identified by the TIS visual interpretation method. For box prompt generation, we fine-tune Grounding DINO on an object detection dataset to produce bounding box prompts. Finally, SAM generates masks from the point and box prompts separately, and the results are merged using a union operation to form the predicted damage segmentation.

Figure 2 shows the DASeg pipeline in the context of post-disaster damage assessment, which we term DASeg-Quake. In this configuration, the TIS process [23] interprets the fine-tuned DINOv2 on the EID dataset [38], producing saliency maps that highlight regions containing visually significant information. Sampling points from these regions direct DASeg-Quake to focus on areas most relevant for identifying damaged structures. Simultaneously, Grounding DINO generates box prompts after being fine-tuned on an object detection dataset adapted from DSS [43], guiding SAM to prioritize entire damaged regions rather than isolated debris or fragmented structural components. The final damage mask is obtained by merging the SAM outputs from both point and box prompts. In this study, we present a detailed exposition of the DASeg pipeline through its application in DASeg-Quake for post-earthquake damage analysis.

3.1. Data Preparation

To evaluate DASeg-Quake, we selected two post-earthquake social media image datasets: the EID dataset [38] and the DSS dataset [43]. Unlike traditional post-earthquake datasets collected via satellites or drones, social media images lack standardized resolution due to their multi-platform origins. However, they offer timely, ground-level building details that are often unavailable from aerial sources. Details about dataset collection are discussed in the following sections. Table 1 summarizes the characteristics of the EID and DSS datasets used in this study.

3.1.1. EID Dataset

The EID dataset presents a novel four-class earthquake infrastructure damage assessment problem, meticulously compiled from images sourced from various social media databases with a focus on data quality [38]. Unlike previous datasets such as DAD [4], EID establishes detailed annotation guidelines based on recognized damage scales. It comprises 13,513 high-quality images from five significant earthquakes: the Nepal earthquake (2015), the Illapel earthquake (2015), the Ecuador earthquake (2016), the Mexico earthquake (2017), and the Iran–Iraq earthquake (2017). The dataset is categorized into four classes: Irrelevant or non-informative, No damage, Mild damage, and Severe damage.

For the annotation process, each image in the dataset was labeled by three individuals with backgrounds in either civil engineering or computer science, following a structured annotation guideline. All three annotators labeled the same image, and the final class label was determined through majority voting. If no consensus could be reached among the three annotators, the image was excluded from the dataset [38]. To our knowledge, this is the first post-disaster social media image dataset to offer a detailed and transparent annotation protocol, providing a valuable resource for future research.

3.1.2. DSS Dataset

Zhang et al. [43] introduced DSS, the first semantic segmentation social media dataset dedicated to earthquake disaster damage analysis. This dataset comprises 607 images collected from five significant earthquakes: the Wenchuan earthquake (2008), the Haiti earthquake (2010), the Nepal earthquake (2015), the Turkey earthquake (2023), and the Morocco earthquake (2023). In particular, the Morocco earthquake data, comprising 60 images, serves as the test dataset. Expert annotators conducted pixel-level annotations, establishing a three-class categorization: Undamaged Structure, Damaged Structure, and Debris.

For data annotation, two domain experts labeled the images in DSS with a focus on structural building damage. Since the goal of DASeg-Quake is aligned with previous studies [39,40,41] in identifying damaged areas, we reformulated the original segmentation labels as follows:

1.: We relabeled entire buildings as Damaged Structure if any part of the building was originally labeled as Damaged Structure, shifting the focus from damaged subregions to the identification of the whole building.
2.: We converted the original three-class labels into a binary classification task by merging the Damaged Structure and Debris categories into a single Damaged class, while treating all other pixels as Undamaged.

3.1.3. Bounding Box Annotation

The DSS dataset includes three classes: Debris, Damaged Structures, and Undamaged Structures. Under expert supervision, semantic segmentation labels for Debris and Damaged Structures were manually converted into bounding box labels. Despite the availability of pixel-level ground truth annotations, accurately drawing bounding boxes that capture structural damage areas is inherently challenging. Firstly, the irregular shapes of damaged structures and debris, such as collapsed buildings and scattered debris fields, lead to inconsistencies in bounding box alignment among annotators. An example illustrating these labeling challenges is presented in Figure 3. The complexity arises primarily from irregular debris distributions, including scenarios in which debris that surrounds structures is widely scattered across the background or forms conical accumulations. Secondly, it is challenging to minimize the inclusion of irrelevant background areas. When attempting to cover extensive debris fields with a single bounding box, annotators risk incorporating significant amounts of extraneous foreground content. Although the VOC2011 guidelines [65] recommend limiting extraneous pixels to less than 5%, applying this criterion proves difficult due to the irregular and complex shapes of the damaged regions.

Subjectivity is indeed a well-known challenge in annotation tasks, particularly in domain-specific applications such as disaster damage assessment. This issue is not unique to our study, it also affects other well-established datasets such as xBD [3], DSS [43], and EID [38]. For example, both the DSS and EID datasets involved multiple annotators labeling the same images, allowing us to assess and mitigate inter-annotator variability. Moreover, the field previously lacked standard guidelines for drawing bounding boxes around irregular structures such as debris. To address this, we proposed a detailed evaluation of different annotation strategies and quantified their effects on segmentation performance.

To improve annotation reliability, two annotators with backgrounds in computer science and civil engineering manually drew bounding boxes of the damage area under expert guidance. Let the damaged area be enclosed by a bounding box

B_{d}

, where the background (non-damaged) area within

B_{d}

is represented by

A_{bg}

, and the damaged area is

A_{dmg}

. The maximum allowable proportion of irrelevant pixels during labeling is given by

R = \frac{A_{bg}}{A_{dmg} + A_{bg}}

(1)

If the proportion of irrelevant pixels exceeds R, an additional bounding box should be drawn to better capture the damaged area.

Figure 4 illustrates examples of various labeling strategies utilized by annotators. Two annotators independently labeled the DSS dataset employing different bounding box approaches. Each strategy is characterized by the proportion of irrelevant pixels (R), the background area enclosed by the bounding box, but not related to the damaged area. Beyond the strict PASCAL-VOC 2011 guideline (

R \leq 5

%), we evaluated four wider windows: 25–35%, 45–55%, 75–85%, and 95–100%.

Table 2 summarizes the segmentation performance of DASeg-Quake when prompted with the bounding boxes of each strategy. The result reveals that the approach with an irrelevant pixel proportion of “25–35%”. All strategies achieved performance above 83%. Specifically, Table 2 shows that the mIoU score varies minimally between annotators under the same labeling strategy, confirming that the impact of annotation subjectivity on the performance of the final model is controlled in practice. Note that an intermediate range such as 5–25% yields marginally higher accuracy. However, requiring annotators to judge such fine differences with the naked eye would introduce impractical subjectivity.

Consequently, based on these findings, a refined Annotation Guideline derived from VOC2011 is recommended as follows:

Preferably, each object (e.g., building or debris) should be enclosed by a single bounding box, with background pixels comprising no more than approximately 25% to 35% of the total bounding box area.
Limit the labeling of each damaged structure to a maximum of two bounding boxes, with each box clearly containing a distinct, identifiable part of the structure. For debris objects with less defined boundaries, using more than two bounding boxes is permissible.
Bounding boxes should be delineated based on objects as they visibly appear in the original image, rather than solely aligning with segmented mask labels.

3.2. Point Prompt Generation from Saliency Map

The DASeg-Quake model selects “attention points”

V_{p}

from the “damaged” regions of the saliency map to use as point prompts. DINOv2 has shown strong out-of-distribution performance and produces features applicable at image- and pixel-level resolutions [13]. We fine-tuned the DINOv2 model, based on ViT [66], using the EID dataset (details in Section 3.1). To obtain the saliency map of the model, we implement TIS [23].

For an input image I of size

518 \times 518

, the embedding computation module produces an output represented as

e m b e d d i n g (I) = T

, where

T \in R^{N_{t} \times D}

is the token sequence and

N_{t}

is the number of tokens and D is the feature dimension. The embeddings of each transformer layer are concatenated into a matrix

C \in R^{N_{t} \times L \cdot D}

, where L represents the number of layers in the ViT encoders. We then reduce the dimension of C to a smaller matrix

K \in R^{N_{t} \times N_{m}}

, using the K-means clustering [67], where

N_{m}

is the number of masks. According to the original paper [23],

N_{m}

is set to 1024. A binary mask is then generated to retain the important tokens used for the class score calculation.

M_{i j} = \{\begin{matrix} 1 & if K_{i j} \in topk (K_{. j}, N_{k}), \\ 0 & otherwise \end{matrix}

(2)

Here,

N_{k}

represents the number of tokens to sample, set to 50% of

N_{t}

as suggested in original paper [23]. The function

t o p k (K_{. j}, N_{k})

selects the set of

N_{k}

largest elements in

K_{. j}

. The mask weights are generated by retrieving outputs for the “severe damage” (sd) class of EID:

w_{j, s d} = t r a n s f o r m e r_{s d} (F_{j}), for 1 \leq j \leq N_{m}

(3)

where

t r a n s f o r m e r_{s d} (.)

refers to the transformer encoder with the task-specific head in the ViT.

F_{j}

is associated to the mask

M_{. j}

, where

F_{j} = {T_{i} | M_{i j} = 1}

and

F_{j} \in R^{N_{t} \times D}

. The final weighted mask is computed as

T I S_{d} = \sum_{j = 1}^{N_{m}} w_{j, s d} M_{j} ⊘ \sum_{j = 1}^{N_{m}} M_{j}

(4)

where ⊘ denotes element-wise division. The resulting damage map

T I S_{d}

has values ranging from 0 to 1. To generate attention points, we randomly sample

A_{k}

pixels from the regions of

T I S_{d}

where the pixel value exceeds 0.8, and these points are used as point prompts

V_{p}

to guide SAM. Randomly sampling pixels with values above 0.8 ensures that only high confidence, diverse points guide SAM, improve segmentation precision, and reduce computational load.

3.3. Box Prompt Generation from Grounding DINO

We generate the box prompt

V_{b}

using the Grounding DINO model fine-tuned on the updated DSS dataset mentioned in Section 3.1.3. The Grounding DINO utilizes a dual-encoder-single-decoder architecture to process image and text prompts as input. The class names “Damaged Structure” and “Debris” in the DSS is transferred to the text prompts input for the Grounding DINO model.

Given an input image I and the text prompts

T_{k}

, we extract the features of the I and

T_{k}

using the image encoder

E_{I}

and text encoder

E_{T}

, respectively. Afterward, the two sets of features are fed into a feature enhancer module for cross-modality feature fusion. Specifically, the feature enhancer applies deformable self-attention to the image feature

X_{I} \in R^{N_{I} \times D}

and self-attention to the text feature

X_{T} \in R^{N_{T} \times D}

, where

N_{I}

is the number of image tokens and

N_{T}

is the number of text tokens. The language-guided query selection module selects features that are more relevant to the input text as decoder queries, denoted as

I_{N_{q}}

, where

N_{q}

is the number of queries extracted from the encoder’s image features. Finally, the image and text features, along with the cross-modality queries, are fed into a cross-modality decoder, which generates the predicted bounding box that serves as the box prompt

V_{b}

for SAM.

3.4. Mask Fusion

The SAM architecture, which we utilize as a frozen pre-trained model, comprises four primary components: an image encoder, a mask decoder, a fast mask decoder, and a flexible prompt encoder. We first generate the predicted masks separately based on the point prompts

V_{p}

and the box prompts

V_{b}

, and combine these predicted masks through a union operation. By taking their union, we ensure that any part identified by either prompt is included in the final mask, improving robustness and minimizing the risk of missing parts of the object of interest. Specifically, consider

M_{p}

denotes the predicted mask from the point prompt and

M_{b}

denotes the predicted mask from the box prompt; the final mask M is obtained as

M = M_{p} \cup M_{b}

(5)

The SAM model mitigates prediction ambiguity by generating three distinct mask outputs, each with the corresponding model prediction confidence scores. These confidence scores are quantitative indicators of the estimated mIoU of the detected objects. Our methodology employs a deterministic selection criterion, consistently utilizing the predicted mask with the maximum confidence score, as this approach optimally captures the comprehensive features characteristic of the damaged area.

3.5. Evaluation Metrics

The mIoU and Pixel Accuracy (PA) metrics are employed to evaluate the performance of DASeg-Quake as well as previous studies [39,40,41]. The model is assessed on the DSS test set using a binary semantic segmentation approach, classifying regions as either Damaged or Undamaged.

The mIoU score quantifies the overlap between predicted segmentation masks and ground truth masks, averaged over all classes. It is defined as

mIoU = \frac{1}{2} \sum_{i = 1}^{2} \frac{| P_{i} \cap G_{i} |}{| P_{i} \cup G_{i} |}

(6)

where

P_{i}

and

G_{i}

represent the predicted and ground truth sets for class i, which is Damaged or Undamaged. A higher mIoU indicates a better segmentation performance.

The PA measures the proportion of correctly classified pixels across the entire image:

Pixel Accuracy = \frac{Number of correctly classified pixels}{Total number of pixels}

(7)

4. Experimental Methodology

We trained DINOv2 [13] with a ViT-B [66] backbone on the EID dataset. For box prompts, we utilize a Swin-T [68] backbone to fine-tune the Grounding DINO [22] model, striking a balance between performance and efficiency to support the practical application of DASeg-Quake. All experiments were conducted with a Tesla V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 40 GB memory.

To train the DINOv2 model, we set the learning rate to

3 \times 10^{- 5}

, momentum to 0.9, and trained for 30 epochs with a batch size of 32. We employed the ReduceLROnPlateau scheduler for dynamic learning rate adjustment and used Stochastic Gradient Descent (SGD) with the Cross-Entropy loss [69]. Data augmentation was performed using Regular CenterCrop and RandomFlip. For Grounding DINO, in addition to the default configuration recommended by the MMDetection toolbox [70], we set the learning rate to

1 \times 10^{- 4}

and trained for 30 epochs. For both models, early stopping was applied if no model improvement was observed within 10 consecutive epochs. The fine-tuning the DASeg-Quake model required approximately 2.7 h for DinoV2 and 0.78 h (≈47 min) for GroundingDINO, totaling around 3.48 h.

5. Results and Discussion

5.1. Point Prompts Analysis

5.1.1. Number of Point Prompts

We compare the performance of DASeg-Quake with different prompting strategies: point prompts only and bounding box only. As shown in Figure 5, the results indicate that the combined DASeg-Quake pipeline using both prompt formats achieves the highest mIoU score compared to single-type prompt implementations. Moreover, the mIoU increases as the number of point prompts increases, stabilizing after approximately 10 points. Once enough points (around 10) have been added, they effectively represent all critical areas of the detected building and debris, allowing the model to create a well-defined mask. Beyond this threshold, additional points contribute little new information and may instead introduce noise, resulting in negligible improvements or even a slight degradation in segmentation quality. This plateau in performance suggests that most of the damaged areas have already been successfully identified. Based on this observation, we set the number of point prompts to 12 for subsequent experiments. Notably, DASeg-Quake demonstrates remarkable stability across different prompt settings in damage feature extraction, thereby minimizing the complexity of hyperparameter optimization.

5.1.2. Threshold for Attention Map

We evaluated the impact of different threshold values of pixels on the identification of high-attention regions to sample the point of reference

V_{p}

from the saliency map. Following the experimental setup in Section 5.1.1, we set the number of sampled point prompts to 12. As shown in Table 3, setting the threshold value to 0.8 yields the best performance for DASeg-Quake. Setting the threshold too high restricts high-attention areas to small regions within the image, potentially overlooking important features of the object. In contrast, setting a low threshold, such as 0.5, includes an excessive background and introduces noise. Therefore, a threshold value in the range of 0.6 to 0.8 is considered ideal.

5.1.3. Backbone Model Comparison

When fine-tuning a classification model based on DINOv2, we acknowledge that our dataset is relatively small, making it impractical to fine-tune the entire model. We conduct experiments fine-tuning various layer blocks and model backbones to identify the optimal configuration that maximizes classification performance and enhances the saliency map quality. Using the experimental settings described in Section 4, we evaluate the classification performance of DINOv2 with ViT-S, ViT-B, and ViT-L backbones, fine-tuning various numbers of layer blocks (1, 3, 5, and all). Table 4 summarizes the results for disaster damage prediction in the EID dataset, showing that the DINOv2-B backbone achieves the best performance when only three blocks are fine-tuned. These results demonstrate that fine-tuning a limited number of blocks can yield impressive outcomes, even with a small dataset.

The model with optimal performance is then used to generate saliency maps, as illustrated in Figure 6. We include the saliency maps of the ViT-B model and the ground truth as a reference. Although the saliency maps generated by DINOv2-S successfully capture most damaged areas, they exhibit confusion in heavily damaged regions, often delineating irrelevant object outlines, as shown in the last three columns of the visualization in Figure 6. Furthermore, we observe minimal differences between the saliency maps produced by the DINOv2-B and DINOv2-L models. In particular, DINOv2-B achieves better contrast between damaged and undamaged regions, indicating that employing a larger vision transformer yields limited benefit for this task.

5.2. BBox Annotation Analysis

Despite the presence of annotation guidelines, the process of bounding box labeling remains subjective, resulting in variability in the annotations. Figure 3 illustrates an example of different labels of “Debris” applied to the same image by different annotators in this study.

To validate the robustness of DASeg-Quake against annotation variability, we conducted experimental evaluations using independent bounding box annotations from two annotators (Annotators A and B). As shown in Table 5, we find consistent model performance across different annotator-specific bounding box annotations. The “Final BBox” configuration uses labels refined by expert review to serve as the basis for subsequent experiments. This stable performance across different annotation approaches highlights the model’s robustness and minimizes the need for extensive manual refinement, making it well-suited for practical applications.

5.3. Comparison with Previous Related Work

We compare the predicted mask of DASeg-Quake with previous related approaches. As highlighted in Section 2.2, previous studies provide limited evaluation of segmentation performance using mIoU and PA metrics. To address this gap, we reproduce their results and re-evaluate the semantic segmentation performance on the DSS test dataset.

We adhere closely to the training procedures, datasets, and hyperparameter settings outlined in the original papers [39,40,41] to ensure a fair comparison. The models are pre-trained based on the DAD. To accommodate different earthquake scenarios, we combine the Ecuador and Nepal earthquake data from DAD. Furthermore, prior studies merge the Mild damage and Severe damage classes into a single Damaged class and retain the Little-to-no damage class as Undamaged class for training.

We compare DASeg-Quake with the following methods:

Li et al. [39,40]: Utilizing GradCAM [39] and GradCAM++ [40] for visual explanations. Since the original model DAV includes further evaluation of its damage location results, we named the method Li 1 [39] and Li 2 [40] in this experiment.
Deep-Disaster [41]: Employing Vanilla Gradient (VG), Smooth Gradient (SG), and Guided Back-Propagation (GBP) techniques based on a model trained with Knowledge Distillation (KD) methods.

Figure 7 shows the comparable performance of DASeg-Quake against previous studies, it is evident that DASeg-Quake outperforms previous methods by a substantial margin. Specifically, DASeg-Quake achieves an mIoU of 82.59%, which is 9.52% higher than the Li 2 best performing method Li 2. In terms of PA, DASeg-Quake achieves 86. 15%, surpassing the previous best by 16.3%.

These comparisions indicate that DASeg-Quake offers a substantial improvement in disaster damage segmentation. Significant enhancements to the mIoU and PA metrics underscore the effectiveness of our approach and its potential impact on other real-world applications.

5.4. Comparison with Supervised Learning

We compare DASeg-Quake with existing supervised approaches [5,6,7,8,71] for semantic segmentation, fine-tuned on the DSS dataset. To ensure fair comparison, we select smaller pre-trained models that correspond to the ViT-B and Swin-T architectures utilized in the DINOv2 and Grounding DINO variants of DASeg-Quake. To maintain consistency with the original context of the paper [6,7,8], AdamW [72] is utilized with a learning rate of

6 \times 10^{- 5}

and trained for 50 epochs. The mIoU score is used to evaluate the performance of the model.

Figure 8 indecates that DASeg-Quake outperforms comparable supervised segmentation frameworks and achieves a 2.1% mIoU improvement over the BEiT-Base model that performs the best. Although Mask2Former models achieve slightly higher pixel accuracy (PA), the strong mIoU score of DASeg-Quake demonstrates its superior ability to locate damaged regions. The reported FLOPs for DASeg-Quake represent the cumulative computational cost of the entire workflow, including fine-tuning DinoV2 (117.22 GMac), grounding DINO (232.00 GMac), and prompt-based segmentation using SAM-H (2974.00 GMac). The major computational cost in our pipeline comes from the frozen SAM model. However, since SAM decouples image and prompt processing, it offers greater flexibility for damage location in various related tasks, making this cost acceptable [14].

Figure 9 presents a comparison of the results of the semantic segmentation. The method proposed by Li et al. [40] localizes only the general areas of damage without delineating specific objects. Furthermore, while the supervised learning methods Mask2Former [7] and BEiT [8] are able to outline most damaged areas, they sometimes misclassify undamaged buildings or do not detect some debris regions. The results demonstrate that the prompts generated by DASeg-Quake guide SAM to distinguish more detailed outlines of the damaged areas.

5.5. Visualization of DASeg

Figure 10 illustrates examples of damaged masks

M_{b}

,

M_{p}

, and the combined damaged area M. As observed, using either bounding box prompts or point prompts alone fails to capture some damaged regions. Although bounding boxes can outline damaged areas, SAM struggles to segment debris regions due to the high density of sharp edges and broken outlines often exhibited by debris. In contrast, point prompts effectively capture these high-density features but may overlook certain damaged structures. By combining

M_{b}

and

M_{p}

, we obtain a more comprehensive interpretation of the damaged area.

5.6. DASeg for Earthquake-Induced Geo-Damage

To evaluate the domain-adaptive capability of DASeg-Quake, we select earthquake-induced geo-damage as a related but distinct zero-shot scenario. We created a dedicated benchmark consisting of 80 post-earthquake images collected through targeted Google keyword searches. These images were evenly categorized into four types of ground deformation: lateral spreading, surface rupture, landslide, and sinkhole.

Two trained annotators (A and B) independently delineated regions of geo-damage. Note that the damage at the building level was intentionally excluded to isolate terrain-specific characteristics. The masks of Annotator B serve as expert ground truth, while Annotator A represents a human baseline for the agreement between annotators. The zero-shot DASeg-Quake model generates segmentation masks directly, i.e., without any additional fine-tuning on geo-damage imagery.

We present DASeg-Quake’s results together with the inter-annotator agreement between Annotators A and B in order to assess the model’s transferability to the related domain. Table 6 shows the performance of DASeg-Quake in geo-damage analysis. The inter-annotator figures establish an upper bound of 83.93% mIoU and 87.23% PA on this dataset. Without any geo-specific adaptation, DASeg-Quake attains 75.03% mIoU, approximately 90% of human agreement, and even slightly exceeds the human baseline in general PA (88.35% vs 87.23%).

Performance of geo-damage area detection varies across classes. The sinkhole class shows the highest mIoU (87%) due to their compact, high-contrast structure, which aligns well with the edge-oriented prompts leveraged by the SAM backbone. Landslide class also yields strong results in PA (92%), aided by the distinct visual separation between the sliding mass and the surrounding terrain. In contrast, lateral spreading class and rupture class, which are characterized by narrow fissures and low-texture features, are more challenging to segment. DASeg-Quake underperforms in these categories, with mIoU scores approximately 12% lower than the agreement between the annotators. Nonetheless, the results indicate the strong domain adaptability of DASeg-Quake, achieving near-expert pixel-level accuracy and reasonable boundary delineation despite never having seen geo-damage imagery during training.

5.7. Error Analysis and Limitations

We examined the failure cases shared by DASeg-Quake and two semantic segmentation baselines, Mask2Former and BEiT, following the same setting in Figure 9. Figure 11 groups the most representative errors into two categories:

1.: False positives on unrelated objects: In Rows 1–2, DASeg-Quake mistakenly labels low-contrast foreground instances such as person and car as damaged regions. The underlying issue is the frozen Segment Anything (SAM) backbone: SAM’s embedding quality deteriorates on low-contrast features, leading to over-segmentation of irrelevant objects [14].
2.: Missed damage along high-contrast boundaries: Rows 3–4 illustrate the opposite problem: debris and buildings that meet at a sharp edge are only partially covered. SAM treats this edge as an object boundary and truncates the mask, omitting the adjoining damaged pixels. In the DASeg-Quake result for Row 3, a distinct gap separates the debris region from the damaged building segment on the right.

Mask2Former and BEiT, fine-tuned from Cityscapes checkpoints that already distinguish the vehicle class (car, bicycle, bus, truck, train, motorcycle, caravan, trailer) and the human class (person, rider), better suppress the false positives case. However, these models still fail to capture certain damaged regions, such as the debris field in Row 3 and the broken building facade in Row 4. Although DASeg-Quake occasionally misses small connecting regions between damage zones (e.g., the gap between debris and building in Row 3), it tends to produce more conservative masks. In contrast, completely omitting a damaged area, as seen in baseline models, poses a more serious issue for post-disaster assessment.

The DASeg-Quake’s performance could be improved in several complementary ways. First, supplying additional object prompts (e.g., person, car) already available in the Grounding DINO vocabulary (see Section 5.8). Second, adopting the newly released SAM2 [48], whose enhanced edge sensitivity is expected to alleviate the boundary truncation issue. Third, providing more high-quality data for fine-tuning the model. The insights underscore the importance of addressing both segmentation-level ambiguity and foundational model limitations for a robust disaster damage assessment.

5.8. Motivation for Fine-Tuning in Abnormal Pattern Segmentation

In this paper, we fine-tune our model using the EID and DSS datasets. Although zero-shot or unsupervised methods, such as KD, for feature extraction offer advantages in low-resource scenarios, we argue that providing an adaptive and high-performance pipeline is crucial for detecting abnormal patterns. For example, when analyzing images of disaster damage or malignant tumors, the models aim to achieve results that closely align with expert labeling [73], a capability that unsupervised learning methods may lack due to insufficient information. Moreover, slight fine-tuning allows us to retain the benefits of the foundational model while integrating expert guidance. For example, since Grounding DINO is the open-set object detection model, the performance of the DASeg-Quake model is easily enhanced by supplying well-trained prompts such as “people”, “sky”, “car”, etc. Through fine-tuning, we leverage the foundational model’s advanced capabilities for general object detection while embedding expert-driven knowledge to improve the identification of damaged areas.

5.9. Robustness of Hyperparameter Selection

A common concern in prompt-based segmentation pipelines is the potential sensitivity to hyperparameter selection, particularly the threshold used to identify high-saliency regions from attention maps. In DASeg-Quake, we systematically examined this aspect in Section 5.1 and Section 5.2. As shown in Table 3, the model exhibits stable performance across a range of threshold values, from 0.5 to 0.9. While the optimal result is obtained at the default threshold of 0.8 (

mIoU = 82.6 %

), the lowest-performing setting (threshold = 0.5) still achieves an mIoU of 80.8%, representing a 7.73% improvement over the best prior method (Li et al. [40], 73.07%).

In addition, as discussed in Section 3.1.3, we evaluated the effect of annotation variability by testing different bounding box strategies and comparing results from two independent annotators. As shown in Table 2 and Section 3.1.3, segmentation performance (mIoU) remains consistently high, with minimal variation between annotators across all strategies.

These findings demonstrate that DASeg-Quake is not overly sensitive to hyperparameter choices such as the attention threshold, and it is also robust against annotation subjectivity in bounding box generation. This robustness reduces the need for extensive hyperparameter tuning and enhances the practicality of our approach for deployment in real-world, time-sensitive disaster response scenarios.

5.10. Annotation Efficiency and Computational Trade-Offs in DASeg-Quake

We observed that reducing the labeling task from dense pixel-level segmentation to a combination of bounding boxes and image-level classification significantly reduces the annotation burden in disaster damage analysis. On average, manual annotation in the DSS segmentation dataset took approximately 180 s per image [43] for the annotation by experts. In contrast, for the classification labels in the EID dataset, the annotation averaged just 5 s per image after training non-expert annotators, completing 13,513 images in three weeks [38]. For bounding box annotations in DASeg, the average time was approximately 45 s per image, allowing the entire dataset to be labeled in under three weeks. By relying on simple classification and bounding box labels to guide segmentation, DASeg not only achieves higher performance than conventional supervised segmentation methods but also significantly reduces the average annotation time for disaster analysis. These results demonstrate that DASeg offers a scalable and time-efficient labeling strategy, which is highly practical for real-world disaster response scenarios.

In terms of computational efficiency, DASeg-Quake entails a higher training cost than conventional supervised baselines. This is primarily due to the frozen SAM model, whose high FLOPs are dominated by the ViT-H image encoder. However, since SAM decouples image encoding from prompt-driven mask generation, the encoder runs only once per image. DASeg-Quake enables the efficient reuse of image features across multiple prompts without incurring additional costs. Although standard ViT-based segmentation models also perform a single forward pass per image, they lack DASeg’s ability to generate diverse masks conditioned on flexible prompts. Therefore, the additional training cost is offset by the significant reduction in annotation time and the improved segmentation performance, as detailed in Section 5.3 and Section 5.4.

5.11. Potential Application for DASeg

The experimental results of DASeg-Quake demonstrate that DASeg can be effectively implemented in various scenarios with limited data and complex features, including other natural hazards such as hurricanes, tornadoes, and landslides. While the causes of infrastructure damage differ significantly among these scenarios [38], DASeg can adjust its training process to address specific characteristics. For example, for hurricanes, the keyword prompts used in Grounding DINO to obtain box prompts could shift to “water” and “damaged structure”, as building debris may have been displaced or submerged by flooding.

Beyond applications in natural hazard analysis, DASeg has the potential to address challenges in other data-limited domains, such as medical imaging [73], remote sensing [3], and underwater imagery [74]. For example, in medical imaging, the detection of malignant tumors is particularly difficult due to indistinct boundaries caused by metastasis into adjacent tissues [73]. Given the demonstrated robustness of DASeg-Quake in the capture of objects with irregular shapes, we believe that DASeg can effectively address such challenges with a few fine-tuning steps.

6. Future Work

Future work could focus on improving the explainability of disaster damage analysis. Rather than limiting outputs to binary damage maps, refining and diversifying the text prompts could enable the detection of a broader range of post-disaster objects under varying environmental conditions. Beyond pixel-level segmentation using large vision models like SAM [14], incorporating descriptive text generated by large language models (LLMs) or vision-language models (VLMs), such as GPT-4V [75] or BLIP-2 [76], could provide richer, more interpretable explanations of the scene. Given the strong performance demonstrated by large foundation models (LFMs) in our study, we believe that integrating multimodal reasoning represents a promising direction to improve model transparency and trustworthiness further.

The DASeg framework is designed to evolve with advancements in foundation models, ensuring its compatibility with future developments. For instance, in the box prompt generation module, we employ the open-set object detector Grounding DINO. This model can be updated or replaced with other open-set vision-language models (VLMs) such as Owl-ViT [77], OV-DETR [47], ViLD [78], GLIP [79,80], and MDETR [81]. While fine-tuning these VLMs offers potential for improved performance, it requires substantial data samples and significant computational resources. We anticipate that more efficient VLMs will emerge in the future, addressing these challenges.

Similarly, for point prompt generation, the pipeline can incorporate other interpretability models such as ViT-CX [82], Layer-wise Relevance Propagation (LRP) [83], and Transition Attention Maps (TAM) [84]. Note that previous studies have shown that perturbation-based techniques such as TIS and ViT-CX generally produce more robust and interpretable results, albeit at a higher computational cost. Therefore, integrating gradient-based methods, such as LRP and TAM, as alternative visualization technologies may help reduce the computation. Alpha-CLIP [85] is another adaptable solution that has demonstrated strong performance on a variety of tasks, including large language models (LLMs), diffusion models, NeRF, and SAM. Due to its broad generalization capabilities, incorporating Alpha-CLIP into DASeg can enhance the framework’s flexibility and extend its applicability to other disaster analysis scenarios beyond earthquake damage. These models offer additional flexibility and adaptability, thereby enhancing the generality of the DASeg framework. By maintaining compatibility with diverse and evolving foundation models, DASeg underscores its versatility and long-term applicability in various domains.

7. Conclusions

We propose a robust and adaptable framework named DASeg, based on SAM, that enables damage segmentation and feature extraction from post-disaster imagery without requiring pixel-level annotations. In this study, we selected the analysis of earthquake damage in images on social media as a use case to demonstrate the effectiveness of our pipeline, referred to as DASeg-Quake. Unlike methods that require an interactive box and point prompts, DASeg-Quake offers an automatic pipeline to localize damage. Furthermore, the hyperparameters are straightforward to select, reducing the need for extensive hyperparameter tuning.

By incorporating both bounding box and point prompts into the workflow, DASeg-Quake surpasses conventional supervised methods and markedly improves post-earthquake damage segmentation during the early stages of disaster response. Furthermore, the zero-shot DASeg-Quake shows strong domain adaptability in earthquake-induced geo-damage analysis. We hope that our scheme can serve as a novel approach to addressing more real-world semantic segmentation challenges.

Author Contributions

Conceptualization, H.H., M.M.R. and J.D.F.; Methodology, H.H., A.Z. and D.Z.; Validation, H.H.; Formal analysis, H.H.; Data curation, H.H., A.Z. and D.Z.; Writing—original draft, H.H.; Writing—review & editing, H.H., A.Z., D.Z., M.M.R. and J.D.F.; Visualization, H.H.; Supervision, M.M.R. and J.D.F.; Project administration, M.M.R. and J.D.F.; Funding acquisition, M.M.R. and J.D.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Elizabeth and Bill Higginbotham Professorship at Georgia Tech and School of Computing Instruction funding at Georgia Tech.

Data Availability Statement

The EID dataset is archived in DesignSafe (https://doi.org/10.17603/ds2-yj8p-hs62), with its detailed description provided in https://doi.org/10.1177/87552930251335649. The DSS dataset is available at https://arxiv.org/abs/2507.02781. For access to additional raw data, please contact the corresponding author.

Acknowledgments

The research described herein was supported in part by the Elizabeth and Bill Higginbotham Professorship at Georgia Tech and the School of Computing Instruction at Georgia Tech. These supports are gratefully acknowledged.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Earle, P.S.; Wald, D.J.; Jaiswal, K.S.; Allen, T.I.; Hearne, M.G.; Marano, K.D.; Hotovec, A.J.; Fee, J.M. Prompt Assessment of Global Earthquakes for Response (PAGER): A System for Rapidly Determining the Impact of Earthquakes Worldwide; Open-File Report 2009–1131; U.S. Geological Survey: Reston, VA, USA, 2009. Available online: https://pubs.usgs.gov/of/2009/1131/ (accessed on 20 May 2025).
El-Tawil, S.; Aguirre, B.E. Search and rescue in collapsed structures: Engineering and social science aspects. Disasters 2010, 34, 1084–1101. [Google Scholar] [CrossRef]
Gupta, R.; Goodman, B.; Patel, N.; Hosfelt, R.; Sajeev, S.; Heim, E.; Doshi, J.; Lucas, K.; Choset, H.; Gaston, M. Creating xBD: A Dataset for Assessing Building Damage from Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Nguyen, D.; Ofli, F.; Imran, M.; Mitra, P. Damage assessment from social media imagery data during disasters. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Sydney, Australia, 31 July–3 August 2017; Diesner, J., Ferrari, E., Xu, G., Eds.; Association for Computing Machinery, Inc.: New York, NY, USA, 2017; pp. 569–576. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the ECCV; Springer: Cham, Switzerland, 2018. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Sydney, Australia, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. In Proceedings of the ICLR, Online, 25–29 April 2022. [Google Scholar]
Zhang, J.; Zhou, Z.; Mai, G.; Hu, M.; Guan, Z.; Li, S.; Mu, L. Text2seg: Remote sensing image semantic segmentation via text-guided visual foundation models. arXiv 2023, arXiv:2304.10597. [Google Scholar]
Zhang, C.; Zhang, C.; Li, C.; Qiao, Y.; Zheng, S.; Dam, S.K.; Zhang, M.; Kim, J.U.; Kim, S.T.; Choi, J.; et al. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv 2023, arXiv:2304.06488. [Google Scholar]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Gläser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9630–9640. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of Machine Learning Research, Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research (PMLR): Brookline, MA, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Zhang, C.; Puspitasari, F.D.; Zheng, S.; Li, C.; Qiao, Y.; Kang, T.; Shan, X.; Zhang, C.; Qin, C.; Rameau, F.; et al. A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv 2023, arXiv:2306.06211. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. Microsoft COCO: Common Objects in Context. In Proceedings of the ECCV; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing Through ADE20K Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Taghanaki, S.A.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep semantic segmentation of natural and medical images: A review. Artif. Intell. Rev. 2019, 54, 137–178. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Englebert, A.; Stassin, S.; Nanfack, G.; Mahmoudi, S.A.; Siebert, X.; Cornu, O.; De Vleeschouwer, C. Explaining Through Transformer Input Sampling. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 806–815. [Google Scholar]
Albrecht, C.M.; Blair, J.; Nevill-Manning, C.; Smith, D.; Soroker, D.; Valero, M.; Wilkin, P. Next-generation geospatial-temporal information technologies for disaster management. IBM J. Res. Dev. 2020, 64, 5:1–5:13. [Google Scholar] [CrossRef]
Wheeler, B.J.; Karimi, H.A. Deep learning-enabled semantic inference of individual building damage magnitude from satellite images. Algorithms 2020, 13, 195. [Google Scholar] [CrossRef]
Maxar. Open Data Program. 2025. Available online: https://www.maxar.com/open-data (accessed on 6 July 2025).
Geotechnical Extreme Events Reconnaissance (GEER) Association. Geotechnical Extreme Events Reconnaissance (GEER) Association. 2025. Available online: https://www.geerassociation.org/ (accessed on 6 July 2025).
National Oceanic and Atmospheric Administration (NOAA)—National Geodetic Survey. Emergency Response Imagery Online Viewer. 2025. Available online: https://storms.ngs.noaa.gov/ (accessed on 6 July 2025).
Da, Y.; Ji, Z.; Zhou, Y. Building damage assessment based on Siamese hierarchical transformer framework. Mathematics 2022, 10, 1898. [Google Scholar] [CrossRef]
Freddi, F.; Galasso, C.; Cremen, G.; Dall’Asta, A.; Di Sarno, L.; Giaralis, A.; Gutiérrez-Urzúa, F.; Málaga-Chuquitaype, C.; Mitoulis, S.; Petrone, C.; et al. Innovations in earthquake risk reduction for resilience: Recent advances and challenges. Int. J. Disaster Risk Reduct. 2021, 60, 102267. [Google Scholar] [CrossRef]
GEER Association; EERI Learning From Earthquakes Program. 2023 Türkiye Earthquake Sequence: Preliminary Virtual Reconnaissance Report; Virtual Reconnaissance Report GEER-082; Geotechnical Extreme Events Reconnaissance (GEER) Association & Earthquake Engineering Research Institute (EERI), Kahramanmaraş: Atlanta, GA, USA, 2023; Report date: 6 May 2023. [Google Scholar] [CrossRef]
GEER Association; EERI Learning From Earthquakes Program. February 6, 2023 Türkiye Earthquakes: Reconnaissance Report on Geotechnical and Structural Impacts; Full Reconnaissance Report GEER-082; Geotechnical Extreme Events Reconnaissance (GEER) Association & Earthquake Engineering Research Institute (EERI): Atlanta, GA, USA, 2023. [Google Scholar] [CrossRef]
Robinson, C.; Gupta, R.; Fobi Nsutezo, S.; Pound, E.; Ortiz, A.; Rosa, M.; White, K.; Dodhia, R.; Zolli, A.; Birge, C.; et al. Turkey Earthquake Report; Technical Report MSR-TR-2023-7; Microsoft: Redmond, WA, USA, 2023. [Google Scholar]
Alam, F.; Imran, M.; Ofli, F. Image4Act: Online Social Media Image Processing for Disaster Response. In Proceedings of the2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Sydney, NSW, Australia, 31 July–3 August 2017; pp. 601–604. [Google Scholar]
Nguyen, T.D.; Alam, F.; Ofli, F.; Imran, M. Automatic Image Filtering on Social Networks Using Deep Learning and Perceptual Hashing During Crises. arXiv 2017, arXiv:1704.02602. [Google Scholar]
Nia, K.R.; Mori, G. Building Damage Assessment Using Deep Learning and Ground-Level Image Data. In Proceedings of the 2017 14th Conference on Computer and Robot Vision (CRV), Edmonton, AB, Canada, 16–19 May 2017; pp. 95–102. [Google Scholar]
Alam, F.; Alam, T.; Hasan, M.A.; Hasnat, A.; Imran, M.; Ofli, F. MEDIC: A Multi-Task Learning Dataset for Disaster Image Classification. arXiv 2021, arXiv:2108.12828. [Google Scholar] [CrossRef]
Huang, H.; Zhang, D.; Masalava, A.; Roozbahani, M.M.; Roy, N.; Frost, J.D. Enhancing the Fidelity of Social Media Image Data Sets in Earthquake Damage Assessment. Earthq. Spectra 2025, 41, 1–35. [Google Scholar] [CrossRef]
Li, X.; Caragea, D.; Zhang, H.; Imran, M. Localizing and quantifying damage in social media images. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28–31August 2018; pp. 194–201. [Google Scholar]
Li, X.; Caragea, D.; Zhang, H.; Imran, M. Localizing and quantifying infrastructure damage using class activation mapping approaches. Soc. Netw. Anal. Min. 2019, 9, 44. [Google Scholar] [CrossRef]
Shekarizadeh, S.; Rastgoo, R.; Al-Kuwari, S.; Sabokrou, M. Deep-Disaster: Unsupervised Disaster Detection and Localization Using Visual Data. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2814–2821. [Google Scholar] [CrossRef]
Alam, F.; Alam, T.; Ofli, F.; Imran, M. Social Media Images Classification Models for Real-time Disaster Response. arXiv 2021, arXiv:2104.04184. [Google Scholar]
Zhang, D.; Huang, H.; Smith, N.S.; Roy, N.; Frost, J.D. From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images. arXiv 2025, arXiv:2507.02781. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Sydney, Australia, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Zang, Y.; Li, W.; Zhou, K.; Huang, C.; Loy, C.C. Open-Vocabulary DETR with Conditional Matching. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
Wu, Q.; Osco, L.P. samgeo: A Python package for segmenting geospatial data with the Segment Anything Model (SAM). J. Open Source Softw. 2023, 8, 5663. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Yan, Z.; Li, J.; Li, X.; Zhou, R.; Zhang, W.; Feng, Y.; Diao, W.; Fu, K.; Sun, X. RingMo-SAM: A Foundation Model for Segment Anything in Multimodal Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5625716. [Google Scholar] [CrossRef]
Pu, X.; Jia, H.; Zheng, L.; Wang, F.; Xu, F. ClassWise-SAM-Adapter: Parameter-Efficient Fine-Tuning Adapts Segment Anything to SAR Domain for Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1234–1245. [Google Scholar] [CrossRef]
Zheng, L.; Pu, X.; Zhang, S.; Xu, F. Tuning a SAM-Based Model With Multicognitive Visual Adapter to Remote Sensing Instance Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 2737–2748. [Google Scholar] [CrossRef]
Wei, Y.; Feng, J.; Liang, X.; Cheng, M.M.; Zhao, Y.; Yan, S. Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1568–1576. [Google Scholar]
Jiang, P.T.; Zhang, C.B.; Hou, Q.; Cheng, M.M.; Wei, Y. LayerCAM: Exploring Hierarchical Class Activation Maps. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef]
Feng, J.; Wang, X.; Liu, W. Deep Graph Cut Network for Weakly-Supervised Semantic Segmentation. Sci. China Inf. Sci. 2021, 64, 130105. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. arXiv 2015, arXiv:1512.04150. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Abnar, S.; Zuidema, W. Quantifying Attention Flow in Transformers. arXiv 2020, arXiv:2005.00928. [Google Scholar]
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
Chefer, H.; Gur, S.; Wolf, L. Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 397–406. [Google Scholar]
Chefer, H.; Gur, S.; Wolf, L. Transformer Interpretability Beyond Attention Visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 782–791. [Google Scholar]
Xie, W.; Li, X.H.; Cao, C.C.; Zhang, N.L. ViT-CX: Causal Explanation of Vision Transformers. In Proceedings of the Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, Macao, 19–25 August 2023; Elkind, E., Ed.; pp. 1569–1577. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2011 (VOC2011) Results. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2011/guidelines.html (accessed on 1 January 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. JSTOR Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Hu, M.; Li, Y.; Yang, X. BreastSAM: A Study of Segment Anything Model for Breast Tumor Detection in Ultrasound Images. arXiv 2023, arXiv:2305.12447. [Google Scholar]
Raveendran, S.; Patil, M.D.; Birajdar, G.K. Underwater image enhancement: A comprehensive review, recent trends, challenges and applications. Artif. Intell. Rev. 2021, 54, 5413–5467. [Google Scholar] [CrossRef]
OpenAI. GPT-4V(ision) Technical Report. 2023. Available online: https://openai.com/research/gpt-4v-system-card (accessed on 6 July 2025).
Li, J.; Li, D.; Xiong, C.; Hoi, S.C. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 12885–12900. [Google Scholar]
Minderer, M.; Gritsenko, A.A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple Open-Vocabulary Object Detection with Vision Transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar]
Tangkaratt, V.; Han, B.; Khan, M.E.; Sugiyama, M. VILD: Variational Imitation Learning with Diverse-quality Demonstrations. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.H.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv 2022, arXiv:2206.05836. [Google Scholar]
Kamath, A.; Singh, M.; LeCun, Y.; Misra, I.; Synnaeve, G.; Carion, N. MDETR–Modulated Detection for End-to-End Multi-Modal Understanding. arXiv 2021, arXiv:2104.12763. [Google Scholar]
Xie, W.; Li, X.-H.; Cao, C.C.; Zhang, N.L. ViT-CX: Causal Explanation of Vision Transformers. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vienna, Austria, 23–29 July 2022. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef]
Goldfeld, Z.; van den Berg, E.; Greenewald, K.; Melnyk, I.; Nguyen, N.; Kingsbury, B.; Polyanskiy, Y. Estimating Information Flow in Deep Neural Networks. arXiv 2018, arXiv:1810.05728. [Google Scholar]
Sun, Z.; Fang, Y.; Wu, T.; Zhang, P.; Zang, Y.; Kong, S.; Xiong, Y.; Lin, D.; Wang, J. Alpha-CLIP: A CLIP Model Focusing on Wherever You Want. arXiv 2023, arXiv:2312.03818. [Google Scholar] [CrossRef]

Figure 1. The proposed DASeg pipeline. The left and right branches perform point prompt generation and box prompt generation, respectively. The black region in the input image I correspond to the damage area need to be segmented. The fire and snowflake icons represent fine-tuned and frozen models, respectively. The green box indicates a bounding box prompt, and the green star marks a point prompt. Predicted segmentation regions are displayed in blue.

Figure 2. The DASeg-Quake framework. The steps align with the pipeline in Figure 1, indicating the progress from the prompt generation to post-earthquake damage area segmentation. Green boxes denote the detected damaged structures by Grounding DINO, and green stars mark the attention points most relevant to damaged structures.

Figure 3. Example of subjective annotation for class “Debris”. Even with annotation guidelines, the debris class is difficult to label. Annotator A and B both provide meaningful annotations for this image. However, the bounding boxes’ locations are different.

Figure 4. Illustration of various bounding box labeling strategies. The arrow indicates the increasing proportion of irrelevant pixels (R) as the labeling approach transitions from conservative to more generalized strategies. At approximately 100%, the strategy aims to encompass the entire damaged area within a single bounding box, disregarding the background area (

A_{b g}

). The sample photo below each strategy depicts the corresponding ground truth bounding boxes, where blue solid lines indicate Debris and orange dashed lines represent Damaged Structures (DS).

Figure 4. Illustration of various bounding box labeling strategies. The arrow indicates the increasing proportion of irrelevant pixels (R) as the labeling approach transitions from conservative to more generalized strategies. At approximately 100%, the strategy aims to encompass the entire damaged area within a single bounding box, disregarding the background area (

A_{b g}

). The sample photo below each strategy depicts the corresponding ground truth bounding boxes, where blue solid lines indicate Debris and orange dashed lines represent Damaged Structures (DS).

Figure 5. The mIoU score of damage localization with different setting. The blue line represents damage localization results with varying numbers of point prompts, the purple line shows results for the bounding box prompt, and the light red line indicates the performance of DASeg-Quake.

Figure 6. The saliency map generated from different models. GT presents the ground truth mask of the damaged area. In the saliency visualizations, red areas indicate higher model attention, while blue areas indicate lower attention.

Figure 7. The comparative performance of DASeg-Quake against previous studies. Bold values in the table indicate the highest mIoU and Pixel Accuracy (PA) among previous studies and the corresponding values for DASeg-Quake.

Figure 8. Comparison to fully supervised segmentation approaches on the DSS dataset for binary damage classification. Bold values in the table indicate the highest mIoU and Pixel Accuracy (PA) among previous studies, and the corresponding values for DASeg-Quake when it outperforms them. FLOPs presents to Floating Point Operations. In the histogram, DeepLabV3-MNetV3 and DeepLabV3-Res101 refer to DeepLabV3 models using MobileNetV3 and ResNet101 backbones, respectively.

Figure 9. Damaged areas detected by DASeg-Quake. The Mask2Former and BEiT model utilizes Swin-S and ViT-B as the backbone, respectively. The red regions indicate the ground truth damage areas labeled by expert annotators, while the light green regions represent the predicted results from various models.

Figure 10. Visualization of the DASeg-Quake output. GT presents the ground truth mask of the damaged area. The green boxes and stars represent the model’s box prompts and point prompts, respectively.

Figure 11. The semantic segmentation results of misclassified images in DASeg. GT presents the ground truth mask of the damaged area.

Table 1. Summary of the EID and DSS datasets used in this study. “#Classes” and “#Images” indicate the number of classes and images, respectively, where “#” denotes “number of.”

Dataset	Image Source(s)	Disaster Events (year)	#Classes	#Images	Task	Resolution Range
EID [38]	X (Twitter) Google	Nepal (2015) Illapel (2015) Ecuador (2016) Mexico (2017) Iran–Iraq (2017)	4	13,513	Classification	48 × 48– 7191 × 4571 pixels
DSS [43]	Google	Wenchuan (2008) Haiti (2010) Nepal (2015) Türkiye (2023) Morocco (2023)	3	607	Semantic segmentation	250 × 213– 8258 × 5505 pixels

Table 2. The mIoU score of DASeg-Quake with different labeling strategy. Avg presents the average score from Annotator A and Annotator B. Bold values indicate the highest score within each numeric column.

R	Annotator A%	Annotator B%	Avg%
0∼5%	83.3	82.6	83.0
25∼35%	85.5	86.5	86.0
45∼55%	83.8	85.5	84.7
75∼85%	83.3	83.4	83.4
95∼100%	84.3	84.9	84.6

Table 3. The evaluation results of different threshold values to generate point prompts

V_{p}

. Bold values indicate the highest mIoU and Pixel Accuracy (PA) score.

Table 3. The evaluation results of different threshold values to generate point prompts

V_{p}

. Bold values indicate the highest mIoU and Pixel Accuracy (PA) score.

Threshold	mIoU (%)	PA (%)
0.5	80.8	84.3
0.6	81.1	84.7
0.7	81.2	85.0
0.8	82.6	86.3
0.9	80.5	84.9

Table 4. Classification results for disaster damage prediction using different backbones and numbers of fine-tuned blocks. “–” indicates the same model variant as above. Bold values indicate the highest Accuracy and F1 score.

Model Variant	Num of Blocks	Acc (%)	F1 (%)
DINOv2-S	1	90.7	90.1
–	3	91.3	90.8
–	5	91.0	90.5
–	All	91.0	90.7
DINOv2-B	1	91.6	91.3
–	3	92.1	92.2
–	5	91.7	91.6
–	All	91.6	91.3
DINOv2-L	1	90.3	89.8
–	3	91.6	91.2
–	5	90.4	89.9
–	All	90.4	89.7

Table 5. Comparison of mIoU and Pixel Accuracy (PA) across different annotators. The final bounding box represents the bounding box that achieved consensus.

	mIoU (%)	PA (%)
Annotator A	79.86	87.86
Annotator B	79.82	87.89
Final BBox	81.56	88.19

Table 6. Inter-annotator agreement and zero-shot DASeg-Quake performance against expert reference (Annotator B).

Disaster Type	Annotator A vs. B		DASeg-Quake (Zero-Shot) vs. B
Disaster Type	mIoU (%)	PA (%)	mIoU (%)	PA (%)
Lateral Spreading	79.87	83.31	68.41	83.81
Rupture	82.03	85.29	69.24	81.47
Landslide	81.59	84.38	74.98	92.15
Sinkhole	92.22	95.93	87.49	95.98
Average	83.93	87.23	75.03	88.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, H.; Zhang, A.; Zhang, D.; Roozbahani, M.M.; Frost, J.D. DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case. Remote Sens. 2025, 17, 2812. https://doi.org/10.3390/rs17162812

AMA Style

Huang H, Zhang A, Zhang D, Roozbahani MM, Frost JD. DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case. Remote Sensing. 2025; 17(16):2812. https://doi.org/10.3390/rs17162812

Chicago/Turabian Style

Huang, Huili, Andrew Zhang, Danrong Zhang, Max Mahdi Roozbahani, and James David Frost. 2025. "DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case" Remote Sensing 17, no. 16: 2812. https://doi.org/10.3390/rs17162812

APA Style

Huang, H., Zhang, A., Zhang, D., Roozbahani, M. M., & Frost, J. D. (2025). DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case. Remote Sensing, 17(16), 2812. https://doi.org/10.3390/rs17162812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case

Abstract

1. Introduction

2. Related Work

2.1. Limitations of Aerial-Based Disaster Damage Assessment

2.2. Disaster Damage Assessment Studies on Social Media Images

2.3. Semantic Segmentation of SAM

2.4. Explainability in Computer Vision

3. Materials and Methods

3.1. Data Preparation

3.1.1. EID Dataset

3.1.2. DSS Dataset

3.1.3. Bounding Box Annotation

3.2. Point Prompt Generation from Saliency Map

3.3. Box Prompt Generation from Grounding DINO

3.4. Mask Fusion

3.5. Evaluation Metrics

4. Experimental Methodology

5. Results and Discussion

5.1. Point Prompts Analysis

5.1.1. Number of Point Prompts

5.1.2. Threshold for Attention Map

5.1.3. Backbone Model Comparison

5.2. BBox Annotation Analysis

5.3. Comparison with Previous Related Work

5.4. Comparison with Supervised Learning

5.5. Visualization of DASeg

5.6. DASeg for Earthquake-Induced Geo-Damage

5.7. Error Analysis and Limitations

5.8. Motivation for Fine-Tuning in Abnormal Pattern Segmentation

5.9. Robustness of Hyperparameter Selection

5.10. Annotation Efficiency and Computational Trade-Offs in DASeg-Quake

5.11. Potential Application for DASeg

6. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI