1. Introduction
Assessing the condition of buildings after natural or man-made disasters is a central challenge that must be carried out with the utmost care and accuracy so as to aid in coordinating emergency response. In order to allocate scarce resources where they are most needed, efficient rescue and relief efforts depend on promptly determining the location and extent of infrastructure damage. Traditional, largely manual assessment workflows often take days or weeks at the city scale, whereas operational decisions in the field must typically be made within hours of an event.
Current damage assessment methods have a number of drawbacks. Reviewing satellite or aerial imagery by hand is slow and challenging to scale. Although automated systems have the potential to generate valuable forecasts, practitioners frequently lack clarity and find it difficult to convert technical results into practical recommendations. It is often neglected to bridge the gap between low-level model outputs (such as pixel-wise masks or scores) and the high-level, narrative data required by decision-makers. Therefore, the emergency management teams require integrated technologies that can accurately detect and classify damage and convey findings in a language that is both clear and operationally meaningful.
Using a unified multimodal AI pipeline that combines computer vision, vision–language reasoning, and natural language synthesis, our proposed DisasterReliefGPT is intended to overcome these shortcomings. The framework comprises three primary modules: DisasterOCS, which facilitates precise building localization and damage segmentation using multi-temporal satellite imagery; an LVLM, which provides comprehensive contextualization by analyzing both the original images and their associated segmentation masks; and an LLM, which converts these intermediate outputs into structured reports, summaries, and suggestions specific to emergency response requirements.
The problem of building damage assessment is conceived as a two-part connected task: the first part includes identifying structures in pre-disaster photos, and the second part involves classifying the degree of damage to those structures in post-disaster photos. This is seen as a one-to-many semantic change detection problem, wherein a singular semantic category in the pre-event image (building against background) is correlated with a spectrum of damage classifications in the post-event image, extending from intact to destroyed. This specific formulation technique enables the model to explicitly depict how the status of each building changed over the course of the two time points.
Instead of suggesting a completely new segmentation architecture, this work mainly concentrates on system integration and operationalization. It aims to show how developments in computer vision, vision–language modeling, and language generation may be integrated into a workable end-to-end pipeline for disaster response in the real world.
The following is a summary of this work’s primary contributions:
- 1.
For comprehensive disaster impact assessment and reporting, a single DisasterReliefGPT architecture integrates computer vision, vision–language modeling, and language generation.
- 2.
The DisasterOCS segmentation framework uses partially shared ResNet34 encoders with task-specific contextual processing and a MultiCrossEntropyDiceLoss objective to improve multi-class damage segmentation on temporal image pairs.
- 3.
An LVLM is incorporated that processes segmentation masks and raw images together to provide more context-aware visual interpretation and explanation.
- 4.
A post-processing step at the object level reduces noisy, pixel-wise label fluctuations by enforcing semantic consistency at the level of individual buildings.
- 5.
Extensive evaluation on a large-scale benchmark dataset demonstrates strong segmentation performance and the ability to synthesize natural-language outputs that are suitable for operational use in diverse disaster scenarios.
2. Related Work
Post-disaster damage assessment using remote sensing has received significant attention due to its importance in enabling rapid response and recovery operations. Early research in this area primarily focused on change detection techniques using satellite imagery. Singh [
1] provided one of the earliest comprehensive reviews of digital change detection methods using remotely sensed data, highlighting techniques such as image differencing, ratioing, and post-classification comparison. These early methods relied on pixel-level statistical analysis and manual interpretation.
The domain of building damage assessment has progressed from early pixel-based change identification techniques [
1] to advanced deep learning methodologies. Long and colleagues [
2] pioneered fully convolutional networks for semantic segmentation, establishing a fundamental architecture for dense prediction tasks in aerial imagery analysis.
The emergence of deep learning revolutionized image analysis and semantic segmentation tasks. Long et al. [
2] introduced fully convolutional networks (FCNs), enabling end-to-end training for dense prediction tasks such as image segmentation. Ronneberger et al. [
3] proposed the U-Net architecture, which uses an encoder–decoder structure with skip connections to retain spatial information. U-Net has since become widely adopted for segmentation tasks in both medical imaging and remote sensing applications.
The U-Net architecture was developed by Ronneberger et al. [
3] and is currently widely used in applications for evaluating building deterioration. Further architectural advancements improved the ability of neural networks to learn hierarchical visual features. He et al. [
4] introduced Residual Networks (ResNets), which allow very deep neural networks to be trained effectively through residual learning. Lin et al. [
5] proposed Feature Pyramid Networks (FPNs), enabling multi-scale feature extraction, which is particularly useful for analyzing satellite imagery where objects vary significantly in scale.
While Lin et al. [
5] created Feature Pyramid Networks for effective multi-scale object identification and segmentation, He et al. [
4] introduced ResNet with skip connections, enabling deeper network architectures capable of capturing complex patterns in satellite imagery.
Handling class imbalance in segmentation tasks has also been widely studied. Milletari et al. [
6] introduced Dice loss, which directly optimizes the overlap between predicted and ground-truth regions. The Tversky index [
7] further extends this concept by introducing tunable parameters that balance false positives and false negatives. These loss functions are particularly useful for damage detection tasks where damaged areas occupy a relatively small portion of the image.
In order to overcome class distribution imbalances in disaster scenarios, loss function design is still essential. By emphasizing the overlap between anticipated and actual regions, the Dice loss formulation, which was first presented by Milletari et al. [
6], has proven successful in reducing class imbalance. The precision–recall tradeoff can be adjusted using tunable parameters thanks to the Tversky loss, which is obtained from the Tversky index [
7].
With improvements in high-resolution satellite imagery, object-based image analysis (OBIA) has become an important research direction. Instead of analyzing individual pixels, OBIA methods group pixels into meaningful objects to capture spatial and contextual information. Blaschke [
8] demonstrated that object-based approaches can significantly improve classification accuracy in remote sensing tasks involving complex structures such as buildings. While Wu et al. [
9] developed effective connected component identification algorithms to support real-time processing of satellite imagery collections, Blaschke [
8] established the principles of object-based image analysis, which emphasize the significance of spatial context and object-level coherence in remote sensing applications.
Multimodal AI techniques have been popular in disaster assessment applications in recent years. The integration of Large Vision–Language Models (LVLMs) for expedited post-disaster damage assessment and reporting was studied by Chen and colleagues [
10], demonstrating the potential of multimodal approaches to generate outputs that are legible by humans. Al Shafian and Hu [
11] investigated the combination of remote sensing and machine learning in disaster management, highlighting the significance of systems that can successfully convey outcomes to stakeholders who are not technical.
Recent dataset developments have boosted progress in disaster assessment research. Kopiika and associates [
12] formulated techniques for expedited post-disaster infrastructure damage characterization, while Wang et al. [
13] presented DisasterM3, a remote sensing vision–language dataset enabling multimodal disaster assessment investigation. Calantropio and colleagues [
14] contributed through their research on deep learning for automated building damage assessment utilizing ISPRS datasets.
Large-scale annotated datasets have also played an important role in advancing disaster damage detection research. Gupta et al. [
15] introduced the xBD dataset, one of the largest publicly available datasets for building damage assessment across multiple disaster types. The xView2 study [
16] examined building damage using satellite imagery to support rapid post-disaster damage assessment.
Recent research has explored the integration of multimodal artificial intelligence systems for disaster management. Radford et al. [
17] introduced CLIP, a vision–language model capable of learning joint representations of images and text. Such models have demonstrated strong performance in cross-modal understanding tasks and have begun to be applied to remote sensing applications.
Despite these advances, many existing approaches focus primarily on individual components of the disaster assessment pipeline, such as segmentation accuracy or multimodal reasoning. Few studies integrate damage detection, contextual interpretation, and automated reporting within a unified framework. The proposed DisasterReliefGPT framework addresses this gap by combining semantic segmentation, object-level damage analysis, vision–language interpretation, and automated report generation into a single end-to-end system designed for operational disaster response.
In contrast to prior work that focuses primarily on either segmentation accuracy or multimodal interpretation, the proposed system emphasizes end-to-end integration from damage detection to automated situation reporting. The approach combines object-level semantic consistency, modular separation of perception and reporting components, and automated generation of operational summaries. This positioning highlights the practical value of the proposed pipeline for real-world disaster response scenarios.
3. Methodology
3.1. Dataset Description
Our experimentation is based on the xBD dataset, which includes 22,068 satellite images belonging to 6 disaster categories spanning over 15 nations. These are from 45,361 square kilometers containing 850,736 structures, distributed across 4 continents and including 4 damage intensity levels.
Figure 1 confirms the wide coverage and diversity of the xBD dataset visually.
For effective performance benchmarking,
Figure 2 offers a comprehensive representation of data heterogeneity and annotation procedures.
3.2. System Architecture
DisasterReliefGPT uses a multi-phase processing pipeline to transform satellite imagery into intelligence that is useful for operations. As shown in
Figure 3, the configuration incorporates four unified modules:
- 1.
Image Acquisition and Preprocessing Module: Satellite and unmanned aerial vehicle imagery are subjected to context-aware sharpening, histogram normalization, and bilateral noise reduction.
- 2.
DisasterOCS Computer Vision Engine: Structural location and damage intensity masks are created by processing temporal image pairs.
- 3.
LVLM Processing Engine: Using segmentation masks and pre- and post-event imagery analysis, visual interpretation and situational context are generated.
- 4.
LLM Reporting Module: Comprehensive reports, structured summaries, and prioritized suggestions are produced using DisasterOCS and LVLM outputs.
Even though there is no incorporation of a direct feedback loop between these modules, robustness to error propagation is achieved through structural design. The shared encoder in the DisasterOCS model guarantees stable feature extraction across different temporal domains, thereby avoiding unstable change detection. In addition, the output of the segmentation module goes through object-level semantic consolidation before being fed into the LVLM and LLM modules. This helps to reduce the effects of potential pixel-level misclassifications in the earlier stages.
It is important to clarify that the term “GPT” in the system name refers specifically to the automated report generation capability of the final stage rather than the core perception model. The DisasterOCS module is a CNN-based semantic change detection framework and is responsible for all quantitative evaluation results presented in this work. The LVLM and LLM modules operate strictly as post-inference components that transform building-level outputs into human-readable situational reports.
3.3. Problem Formulation
Think of structural damage assessment as two interconnected parts of a single overall goal:
Localization of Structure: In order to provide a spatial reference, the first component locates buildings within pre-disaster photos. The process takes as input and produces the binary mask , where 1 and 0 denote the presence or absence of a building, respectively.
Damage Classification: Using post-disaster photos, the second and last component creates a multi-class mask by estimating severity levels for specific structures. . The classification categories are 0 (background), 1 (undamaged), 2 (minor damage), 3 (major damage), and 4 (destroyed). Internal consistency is preserved within structural pixels.
Here, the definitions of minor and major damage are strictly based on the official annotation standards of the ground truth provided by the xBD dataset. Minor damage entails damage that is visually obvious but limited in terms of impairment, with the structure still largely intact. Major damage, on the other hand, entails severe impairment of the structure, with the building still standing but considered unsafe.
3.4. DisasterOCS Segmentation Framework
The processing sequence incorporates:
- 1.
Pre-disaster and post-disaster image inputs;
- 2.
Shared feature extraction utilizing the ResNet34 foundation architecture;
- 3.
Pyramid-style contextual encoding;
- 4.
Multi-objective decoding with feature integration;
- 5.
Object-oriented post-processing for semantic uniformity.
3.4.1. Partially Shared Encoder Architecture
The encoding part uses four hierarchical residual blocks to implement shared parameters using ResNet34. For image pairs
,
where
preserves parameter sharing while handling both temporal domains.
3.4.2. Task-Specific Contextual Encoding
Feature Pyramid Network-style lateral connections are shown as follows:
The UpsampleBlock operation increases spatial resolution utilizing nearest-neighbor interpolation.
3.4.3. Multi-Objective Decoding
Feature Integration
Decoding implements bilinear upscaling and
convolutional operations across all scales, merging into stride-4 feature representations.Multi-scale feature integration is shown in Algorithm 1.
| Algorithm 1 Multi-scale feature integration in DisasterOCS |
- 1:
Input: Feature representations with scaling factors - 2:
for to 4 do - 3:
- 4:
- 5:
end for - 6:
- 7:
Return: containing 256 channels, scale factor 4
|
Structural Localization Component
Damage Classification Component
originates from object-level features within the localization pathway.
3.4.4. MultiCrossEntropyDiceLoss Formulation
Cross-entropy and Dice loss components are integrated into a composite loss function used by DisasterOCS:
The aggregate multi-objective loss becomes
where
represent tunable parameters for loss weighting and task equilibrium.
The reasoning behind the above composite loss function is to achieve a balance of pixel-wise classification stability and spatial region consistency. Cross-entropy loss is used to ensure the optimization of the entire set of damage classes, making sure that the minority classes are not neglected during training. Dice loss, on the other hand, focuses on the spatial overlap of the predicted and actual regions, which is especially important in disaster scenarios where the affected region of the image is relatively small. The weighting parameters are introduced to avoid the dominance of localization or damage classification in the optimization process and hence achieve a balance in the detection of the structure and learning of fine-grained damage severity.
It is important to clarify that the MultiCrossEntropyDiceLoss formulation is not introduced as a novel loss function. Instead, it is adopted as a practical design choice tailored for the multi-task setting of building localization and damage classification. The key contribution lies in the task-level weighting strategy, which balances localization and damage objectives and helps prevent smaller damage regions from being under-represented during training.
3.4.5. Object-Level Post-Processing
Connected component identification is carried out to ensure semantic consistency within individual structures, and final damage categories are determined by structural-level consensus voting. Object-level post-processing is shown in Algorithm 2.
The semantic consistency within the structural pixels is ensured through the object-level post-processing mechanism. Following the retrieval of the pixel-level structural mask
and damage mask
, connected component labeling is performed on
to retrieve the building instances. The labels of each building in
are then aggregated, followed by the application of a majority voting strategy to retrieve the final damage category. The final mask
is then obtained, in which each building is associated with a single damage category, removing noise in the damage labels of the same building.
| Algorithm 2 Object-level post-processing |
- 1:
Input: (structural mask), (damage mask) - 2:
structural_objects ← ConnectedComponentLabeling() - 3:
for each object o in structural_objects do - 4:
pixel_values ← ExtractPixels(o, ) - 5:
category_distribution ← TallyCategoryVotes(pixel_values) - 6:
dominant_category ← argmax(category_distribution) - 7:
dominant_category - 8:
end for - 9:
Return: (semantically coherent damage mask)
|
3.5. Implementation Details
A ResNet-34 backbone with partial weight sharing between the pre-disaster and post-disaster encoding pathways was used to implement the DisasterOCS framework. Four hierarchical residual blocks make up the encoder, which maintains multi-scale feature representations. Multi-objective optimization was made possible by the use of task-specific decoders for damage classification and structural localization.
Resolution-preserving pre-processing and histogram normalization were applied to input imagery in order to minimize illumination discrepancies between temporal pairs. Both multi-class damage segmentation and localization mask supervision were performed using ground-truth annotations from the xBD dataset.
Dice loss and categorical cross-entropy components are combined in the MultiCrossEntropyDiceLoss function. To strike a balance between region-level spatial coherence and pixel-level classification stability, loss weighting parameters were chosen. To avoid localization and damage classification goals taking precedence over one another, task-level weighting coefficients were added.
Pixel-wise prediction was followed by object-level semantic consolidation. Semantic consistency across building pixels was enforced by majority voting, while structural instances were identified by connected component labeling.
3.6. Qwen2-VL 2B Analytical Processing
The Qwen2-VL 2B model creates structured disaster assessment data by utilizing the segmentation results and visual content. The LVLM module now processes three input streams,
,
, and
, for a neat, comprehensive visual and contextual interpretation. The Qwen2-VL output visualization is shown in
Figure 6.
Input Analysis: It understands original content and segmentation output.
Visual Content Analysis: An example could be identifying objects and their damage signatures, and other environmental contextual information.
Mask Interpretation: It presents the meaning of categories involved with classifications and their spatial distributions in an explanatory fashion.
It is noteworthy that the LVLM is provided with the post-processed mask instead of the raw pixel-wise prediction . In fact, the mask is generated using majority voting in the object domain and thus ensures that all the structures in the scene are represented by a single dominant damage category. This effectively reduces the chance of propagation of errors in the multi-stage pipeline by reducing the effect of potential pixel-wise misclassifications occurring in the segmentation stage.
The following are some of the analytical results:
- 1.
Quantitative indicators: Total structure count, damaged amount, proportional damage, and destruction percentage, all of which are structured as JSON for easy integration.
- 2.
Detailed category analysis: Totals for each class (destroyed, major damage, minor damage, and intact).
- 3.
Severity assessment: A quick classification based on predetermined thresholds, such as CRITICAL, MODERATE, etc.
3.7. LLM Reporting Synthesis
For comprehensive reporting, the LLM module aggregates outputs (Algorithm 3):
Executive overview: A summary of the critical impact. Structured assessment: Comprehensive data that combines quantitative measures and visual interpretations.
Operational recommendations: Prioritized suggestions organized by area and level of damage.
| Algorithm 3 LLM-based assessment workflow |
Input:- 1.
Extract metrics and interpret insights. - 2.
Combine quantitative and contextual information. - 3.
Generate an executive summary and operational guidance. Output: Comprehensive assessment documentation.
|
3.8. Reproducibility and Code Availability
To ensure transparency and reproducibility, the complete implementation of DisasterReliefGPT, including training scripts, pre-processing pipelines, model configurations, and evaluation tools, is publicly available on GitHub (Version 1.0).Detailed setup instructions and experiment reproduction steps are provided in the repository documentation.
4. Results and Analysis
4.1. DisasterOCS Performance Evaluation
Table 1 summarizes DisasterReliefGPT performance across damage categories, indicating balanced precision, recall, and F1 metrics with notable proficiency in undamaged and destroyed structure identification.Qualitative results are shown in
Figure 7.
High precision rates along the principal diagonal, specifically 90.7% for undamaged structures and 81.3% for destroyed buildings, are visually validated in
Figure 8, highlighting the significant classification accuracy achieved by DisasterOCS.
Standard xBD assessment procedures are followed in the evaluation protocol. Ground-truth damage annotations are directly compared to DisasterOCS’s pixel-by-pixel predictions. In order to prevent class-level statistics from being skewed, background pixels are not included in metric computation.
For every damage category, precision, recall, and F1-score are calculated separately. Metrics that represent building-level semantic correctness after object-level post-processing make sure that every structural instance makes a single, cohesive prediction. Micro F1 is a supplementary measure of global classification behavior, whereas macro-averaged F1 is said to be the main performance indicator to counteract the effects of class imbalance.
4.2. Comparative Discussion with Existing Approaches
While direct re-implementation of existing baseline models was beyond the scope of this study, the reported performance is contextualized using published xBD benchmark studies employing comparable evaluation protocols. The comparison emphasizes not only segmentation accuracy but also architectural efficiency, training stability, and interpretability considerations.
In contrast to purely vision-based systems, DisasterReliefGPT introduces an integrated multimodal reasoning and reporting pipeline. This design prioritizes operational usability by transforming segmentation outputs into structured analytical insights and human-readable assessments.
4.3. MultiCrossEntropyDiceLoss Efficacy
The joint MultiCrossEntropyDiceLoss balances the different classes for damage, sharpens the edges of the predicted regions due to the focus on overlap, and provides more stability to the training process. This is particularly helpful in complicated segmentation scenarios with highly imbalanced frequencies, such as disaster damage maps, where certain levels of damage may be much rarer than others.
4.4. Object-Level Post-Processing Outcomes
This object-based post-processing step helps to keep the damage label of each building consistent, enhances semantic consistency to 90.3%, and reduces noisy, pixel-level label fluctuations by 15.2%. The final damage maps now describe the condition of each structure in a more reliable way and are thus more useful for planning emergency response efforts at the level of individual buildings.
4.5. Component Contribution Discussion
Rather than being a group of separately optimized modules, the DisasterReliefGPT framework is intended to function as an end-to-end operational pipeline. Pixel-level predictions, such as damage classification and structural localization, are entirely the responsibility of the DisasterOCS component. The LVLM and LLM modules, on the other hand, only function at the post-inference stage, emphasizing interpretability, contextual comprehension, and report generation without changing segmentation outputs.
As a result, traditional quantitative segmentation metrics do not accurately represent the contributions of the LVLM and LLM. Rather, by converting unstructured predictions into organized insights, visual explanations, and narratives focused on making decisions, these modules improve the system’s operational value. While the LLM creates thorough evaluation reports appropriate for emergency response workflows, the LVLM enhances the semantic interpretation of segmentation masks and imagery.
Future research will involve systematic component-level evaluations to quantify interpretability gains, decision support effectiveness, and human-centered performance metrics, even though extensive quantitative ablation experiments were outside the purview of this study.
4.6. Assessment Findings Interpretation
DisasterReliefGPT exhibits the capability to accomplish the following:
Achieve beyond 90% recall for structural localization.
Segment damage severity effectively by using composite loss functions.
Integrate visual interpretation via the LVLM and structured reporting through the LLM.
Maintain uniform accuracy in a variety of geographical areas and disaster types.
It is crucial to remember that xBD itself includes a great deal of diversity in terms of disaster types, geographic locations, and building attributes, even though the suggested system was only tested on the xBD dataset. The inherent variability within xBD offers a reasonable approximation of domain heterogeneity, even though this study did not explicitly validate out-of-distribution datasets or across datasets. To further confirm the robustness and generalizability of the model, however, systematic evaluation on completely out-of-distribution datasets is still a crucial area for future research.
5. Conclusions and Future Work
This work introduced DisasterReliefGPT, an integrated multimodal artificial intelligence framework that combines LVLMs for visual interpretation of imagery and segmentation results, LLMs for natural language report generation, and DisasterOCS for computer vision-based damage segmentation using MultiCrossEntropyDiceLoss.
Principal achievements consist of the following:
- 1.
The DisasterReliefGPT architecture effectively integrates computer vision, vision–language, and language modeling.
- 2.
The DisasterOCS segmentation pipeline gets a damage F1-score of 78.8% with MultiCrossEntropyDiceLoss optimization.
- 3.
Effective multimodal integration is achieved, in which segmentation masks and source images are processed by the LVLM.
- 4.
Semantic coherence within structural segments is guaranteed by object-aware post-processing.
- 5.
Comprehensive validation shows that it is useful for emergency response operations.
Overall, the contribution of this work lies in demonstrating how semantic change detection, object-level reasoning, multimodal interpretation, and automated reporting can be combined into a unified and operational workflow. This positioning highlights the practical and engineering-focused nature of the proposed system.
The system’s effectiveness is confirmed by the classification matrix analysis, which achieves 81.3% accuracy in identifying destroyed buildings and 90.7% accuracy in identifying undamaged structures. The simplified design is quick enough for real-time disaster response operations while striking a compromise between technical correctness and human comprehension.
Despite the promising results, several limitations remain. The current study evaluates the system primarily on a single benchmark dataset and does not include cross-dataset validation. In addition, explicit uncertainty estimation is not yet incorporated into the pipeline, which could further support decision-making in high-risk scenarios. Finally, the multi-stage architecture introduces additional computational overhead, which may impact deployment in resource-constrained environments. Addressing these aspects forms an important direction for future work.