DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication

Reghunath, Lekshmi Chandrika; Abhishek, Athikkal Sudhir; Changat, Arjun; Unnikrishnan, Arjun; Rai, Ayush Kumar; Napoli, Christian; Randieri, Cristian

doi:10.3390/technologies14030179

Open AccessArticle

DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication

by

Lekshmi Chandrika Reghunath

¹

,

Athikkal Sudhir Abhishek

¹,

Arjun Changat

¹,

Arjun Unnikrishnan

¹,

Ayush Kumar Rai

¹,

Christian Napoli

^2,3

and

Cristian Randieri

^4,2,*

¹

Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham, Coimbatore 641112, India

²

Department of Computer, Control, and Management Engineering “Antonio Ruberti”, Sapienza University of Rome, 00185 Rome, Italy

³

Department of Artificial Intelligence, Czestochowa University of Technology, ul. Dąbrowskiego 69, 42-201 Czestochowa, Poland

⁴

Department of Theoretical and Applied Sciences, eCampus University, Via Isimbardi 10, 22060 Novedrate, Italy

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(3), 179; https://doi.org/10.3390/technologies14030179

Submission received: 6 January 2026 / Revised: 10 March 2026 / Accepted: 11 March 2026 / Published: 16 March 2026

Download

Browse Figures

Versions Notes

Abstract

The work presented herein proposes DisasterReliefGPT, a multimodal AI system for automation in the areas of crisis communication and post-disaster assessment. The system integrates three tightly coupled components: a vision module called DisasterOCS for structural damage detection in satellite images, a Large Vision–Language Model (LVLM) for enhanced visual understanding and contextual reasoning, and a Large Language Model (LLM) to produce detailed, clear assessment reports. DisasterOCS relies on a ResNet34-based encoder with partial weight sharing and event-specific decoders, coupled with a custom MultiCrossEntropyDiceLoss function for multi-class segmentation on pre- and post-disaster image pairs. On the benchmark xBD dataset, the developed system reaches a high score of 78.8% in identifying F1-damage, making correct identifications of destroyed buildings with 81.3% precision, while undamaged structures are found with a very high value of 90.7%. From a combination of these components, emergency responders can immediately provide reliable and readable assessments of damage that can be used to directly support urgent decision-making.

Keywords:

disaster response AI; multimodal learning; damage assessment; semantic segmentation; vision–language models; large vision–language model; emergency management

Graphical Abstract

1. Introduction

Assessing the condition of buildings after natural or man-made disasters is a central challenge that must be carried out with the utmost care and accuracy so as to aid in coordinating emergency response. In order to allocate scarce resources where they are most needed, efficient rescue and relief efforts depend on promptly determining the location and extent of infrastructure damage. Traditional, largely manual assessment workflows often take days or weeks at the city scale, whereas operational decisions in the field must typically be made within hours of an event.

Current damage assessment methods have a number of drawbacks. Reviewing satellite or aerial imagery by hand is slow and challenging to scale. Although automated systems have the potential to generate valuable forecasts, practitioners frequently lack clarity and find it difficult to convert technical results into practical recommendations. It is often neglected to bridge the gap between low-level model outputs (such as pixel-wise masks or scores) and the high-level, narrative data required by decision-makers. Therefore, the emergency management teams require integrated technologies that can accurately detect and classify damage and convey findings in a language that is both clear and operationally meaningful.

Using a unified multimodal AI pipeline that combines computer vision, vision–language reasoning, and natural language synthesis, our proposed DisasterReliefGPT is intended to overcome these shortcomings. The framework comprises three primary modules: DisasterOCS, which facilitates precise building localization and damage segmentation using multi-temporal satellite imagery; an LVLM, which provides comprehensive contextualization by analyzing both the original images and their associated segmentation masks; and an LLM, which converts these intermediate outputs into structured reports, summaries, and suggestions specific to emergency response requirements.

The problem of building damage assessment is conceived as a two-part connected task: the first part includes identifying structures in pre-disaster photos, and the second part involves classifying the degree of damage to those structures in post-disaster photos. This is seen as a one-to-many semantic change detection problem, wherein a singular semantic category in the pre-event image (building against background) is correlated with a spectrum of damage classifications in the post-event image, extending from intact to destroyed. This specific formulation technique enables the model to explicitly depict how the status of each building changed over the course of the two time points.

Instead of suggesting a completely new segmentation architecture, this work mainly concentrates on system integration and operationalization. It aims to show how developments in computer vision, vision–language modeling, and language generation may be integrated into a workable end-to-end pipeline for disaster response in the real world.

The following is a summary of this work’s primary contributions:

1.: For comprehensive disaster impact assessment and reporting, a single DisasterReliefGPT architecture integrates computer vision, vision–language modeling, and language generation.
2.: The DisasterOCS segmentation framework uses partially shared ResNet34 encoders with task-specific contextual processing and a MultiCrossEntropyDiceLoss objective to improve multi-class damage segmentation on temporal image pairs.
3.: An LVLM is incorporated that processes segmentation masks and raw images together to provide more context-aware visual interpretation and explanation.
4.: A post-processing step at the object level reduces noisy, pixel-wise label fluctuations by enforcing semantic consistency at the level of individual buildings.
5.: Extensive evaluation on a large-scale benchmark dataset demonstrates strong segmentation performance and the ability to synthesize natural-language outputs that are suitable for operational use in diverse disaster scenarios.

2. Related Work

Post-disaster damage assessment using remote sensing has received significant attention due to its importance in enabling rapid response and recovery operations. Early research in this area primarily focused on change detection techniques using satellite imagery. Singh [1] provided one of the earliest comprehensive reviews of digital change detection methods using remotely sensed data, highlighting techniques such as image differencing, ratioing, and post-classification comparison. These early methods relied on pixel-level statistical analysis and manual interpretation.

The domain of building damage assessment has progressed from early pixel-based change identification techniques [1] to advanced deep learning methodologies. Long and colleagues [2] pioneered fully convolutional networks for semantic segmentation, establishing a fundamental architecture for dense prediction tasks in aerial imagery analysis.

The emergence of deep learning revolutionized image analysis and semantic segmentation tasks. Long et al. [2] introduced fully convolutional networks (FCNs), enabling end-to-end training for dense prediction tasks such as image segmentation. Ronneberger et al. [3] proposed the U-Net architecture, which uses an encoder–decoder structure with skip connections to retain spatial information. U-Net has since become widely adopted for segmentation tasks in both medical imaging and remote sensing applications.

The U-Net architecture was developed by Ronneberger et al. [3] and is currently widely used in applications for evaluating building deterioration. Further architectural advancements improved the ability of neural networks to learn hierarchical visual features. He et al. [4] introduced Residual Networks (ResNets), which allow very deep neural networks to be trained effectively through residual learning. Lin et al. [5] proposed Feature Pyramid Networks (FPNs), enabling multi-scale feature extraction, which is particularly useful for analyzing satellite imagery where objects vary significantly in scale.

While Lin et al. [5] created Feature Pyramid Networks for effective multi-scale object identification and segmentation, He et al. [4] introduced ResNet with skip connections, enabling deeper network architectures capable of capturing complex patterns in satellite imagery.

Handling class imbalance in segmentation tasks has also been widely studied. Milletari et al. [6] introduced Dice loss, which directly optimizes the overlap between predicted and ground-truth regions. The Tversky index [7] further extends this concept by introducing tunable parameters that balance false positives and false negatives. These loss functions are particularly useful for damage detection tasks where damaged areas occupy a relatively small portion of the image.

In order to overcome class distribution imbalances in disaster scenarios, loss function design is still essential. By emphasizing the overlap between anticipated and actual regions, the Dice loss formulation, which was first presented by Milletari et al. [6], has proven successful in reducing class imbalance. The precision–recall tradeoff can be adjusted using tunable parameters thanks to the Tversky loss, which is obtained from the Tversky index [7].

With improvements in high-resolution satellite imagery, object-based image analysis (OBIA) has become an important research direction. Instead of analyzing individual pixels, OBIA methods group pixels into meaningful objects to capture spatial and contextual information. Blaschke [8] demonstrated that object-based approaches can significantly improve classification accuracy in remote sensing tasks involving complex structures such as buildings. While Wu et al. [9] developed effective connected component identification algorithms to support real-time processing of satellite imagery collections, Blaschke [8] established the principles of object-based image analysis, which emphasize the significance of spatial context and object-level coherence in remote sensing applications.

Multimodal AI techniques have been popular in disaster assessment applications in recent years. The integration of Large Vision–Language Models (LVLMs) for expedited post-disaster damage assessment and reporting was studied by Chen and colleagues [10], demonstrating the potential of multimodal approaches to generate outputs that are legible by humans. Al Shafian and Hu [11] investigated the combination of remote sensing and machine learning in disaster management, highlighting the significance of systems that can successfully convey outcomes to stakeholders who are not technical.

Recent dataset developments have boosted progress in disaster assessment research. Kopiika and associates [12] formulated techniques for expedited post-disaster infrastructure damage characterization, while Wang et al. [13] presented DisasterM3, a remote sensing vision–language dataset enabling multimodal disaster assessment investigation. Calantropio and colleagues [14] contributed through their research on deep learning for automated building damage assessment utilizing ISPRS datasets.

Large-scale annotated datasets have also played an important role in advancing disaster damage detection research. Gupta et al. [15] introduced the xBD dataset, one of the largest publicly available datasets for building damage assessment across multiple disaster types. The xView2 study [16] examined building damage using satellite imagery to support rapid post-disaster damage assessment.

Recent research has explored the integration of multimodal artificial intelligence systems for disaster management. Radford et al. [17] introduced CLIP, a vision–language model capable of learning joint representations of images and text. Such models have demonstrated strong performance in cross-modal understanding tasks and have begun to be applied to remote sensing applications.

Despite these advances, many existing approaches focus primarily on individual components of the disaster assessment pipeline, such as segmentation accuracy or multimodal reasoning. Few studies integrate damage detection, contextual interpretation, and automated reporting within a unified framework. The proposed DisasterReliefGPT framework addresses this gap by combining semantic segmentation, object-level damage analysis, vision–language interpretation, and automated report generation into a single end-to-end system designed for operational disaster response.

In contrast to prior work that focuses primarily on either segmentation accuracy or multimodal interpretation, the proposed system emphasizes end-to-end integration from damage detection to automated situation reporting. The approach combines object-level semantic consistency, modular separation of perception and reporting components, and automated generation of operational summaries. This positioning highlights the practical value of the proposed pipeline for real-world disaster response scenarios.

3. Methodology

3.1. Dataset Description

Our experimentation is based on the xBD dataset, which includes 22,068 satellite images belonging to 6 disaster categories spanning over 15 nations. These are from 45,361 square kilometers containing 850,736 structures, distributed across 4 continents and including 4 damage intensity levels. Figure 1 confirms the wide coverage and diversity of the xBD dataset visually.

For effective performance benchmarking, Figure 2 offers a comprehensive representation of data heterogeneity and annotation procedures.

3.2. System Architecture

DisasterReliefGPT uses a multi-phase processing pipeline to transform satellite imagery into intelligence that is useful for operations. As shown in Figure 3, the configuration incorporates four unified modules:

1.: Image Acquisition and Preprocessing Module: Satellite and unmanned aerial vehicle imagery are subjected to context-aware sharpening, histogram normalization, and bilateral noise reduction.
2.: DisasterOCS Computer Vision Engine: Structural location and damage intensity masks are created by processing temporal image pairs.
3.: LVLM Processing Engine: Using segmentation masks and pre- and post-event imagery analysis, visual interpretation and situational context are generated.
4.: LLM Reporting Module: Comprehensive reports, structured summaries, and prioritized suggestions are produced using DisasterOCS and LVLM outputs.

Even though there is no incorporation of a direct feedback loop between these modules, robustness to error propagation is achieved through structural design. The shared encoder in the DisasterOCS model guarantees stable feature extraction across different temporal domains, thereby avoiding unstable change detection. In addition, the output of the segmentation module goes through object-level semantic consolidation before being fed into the LVLM and LLM modules. This helps to reduce the effects of potential pixel-level misclassifications in the earlier stages.

It is important to clarify that the term “GPT” in the system name refers specifically to the automated report generation capability of the final stage rather than the core perception model. The DisasterOCS module is a CNN-based semantic change detection framework and is responsible for all quantitative evaluation results presented in this work. The LVLM and LLM modules operate strictly as post-inference components that transform building-level outputs into human-readable situational reports.

3.3. Problem Formulation

Think of structural damage assessment as two interconnected parts of a single overall goal:

Localization of Structure: In order to provide a spatial reference, the first component locates buildings within pre-disaster photos. The process takes as input $I_{p r e} \in R^{H \times W \times 3}$ and produces the binary mask $M_{l o c} \in {0, 1}^{H \times W}$ , where 1 and 0 denote the presence or absence of a building, respectively.
Damage Classification: Using post-disaster photos, the second and last component creates a multi-class mask by estimating severity levels for specific structures. $M_{d a m} \in {0, 1, 2, 3, 4}^{H \times W}$ . The classification categories are 0 (background), 1 (undamaged), 2 (minor damage), 3 (major damage), and 4 (destroyed). Internal consistency is preserved within structural pixels.

Here, the definitions of minor and major damage are strictly based on the official annotation standards of the ground truth provided by the xBD dataset. Minor damage entails damage that is visually obvious but limited in terms of impairment, with the structure still largely intact. Major damage, on the other hand, entails severe impairment of the structure, with the building still standing but considered unsafe.

3.4. DisasterOCS Segmentation Framework

The processing sequence incorporates:

1.: Pre-disaster and post-disaster image inputs;
2.: Shared feature extraction utilizing the ResNet34 foundation architecture;
3.: Pyramid-style contextual encoding;
4.: Multi-objective decoding with feature integration;
5.: Object-oriented post-processing for semantic uniformity.

DisasterOCS architecture and workflow are shown in Figure 4 and Figure 5.

3.4.1. Partially Shared Encoder Architecture

The encoding part uses four hierarchical residual blocks to implement shared parameters using ResNet34. For image pairs

(I_{p r e}, I_{p o s t})

,

\begin{matrix} F_{p r e} & = {f_{1}^{p r e}, f_{2}^{p r e}, f_{3}^{p r e}, f_{4}^{p r e}} = E_{θ} (I_{p r e}) \end{matrix}

(1)

\begin{matrix} F_{p o s t} & = {f_{1}^{p o s t}, f_{2}^{p o s t}, f_{3}^{p o s t}, f_{4}^{p o s t}} = E_{θ} (I_{p o s t}) \end{matrix}

(2)

where

E_{θ}

preserves parameter sharing while handling both temporal domains.

3.4.2. Task-Specific Contextual Encoding

Feature Pyramid Network-style lateral connections are shown as follows:

F_{l}^{c t x} = UpsampleBlock (F_{l + 1}^{c t x}) + {Conv}_{1 \times 1} (F_{l}^{e n c})

(3)

The UpsampleBlock operation increases spatial resolution utilizing nearest-neighbor interpolation.

3.4.3. Multi-Objective Decoding

Feature Integration

Decoding implements bilinear upscaling and

3 \times 3

convolutional operations across all scales, merging into stride-4 feature representations.Multi-scale feature integration is shown in Algorithm 1.

Algorithm 1 Multi-scale feature integration in DisasterOCS

1:: Input: Feature representations ${F_{1}, F_{2}, F_{3}, F_{4}}$ with scaling factors ${4, 8, 16, 32}$
2:: for $i = 1$ to 4 do
3:: $F_{i}^{u p} \leftarrow BilinearUpsample (F_{i}, target_scale = 4)$
4:: $F_{i}^{u p} \leftarrow {Conv}_{3 \times 3} (F_{i}^{u p})$
5:: end for
6:: $F_{i n t e g r a t e d} \leftarrow \sum_{i = 1}^{4} F_{i}^{u p}$
7:: Return: $F_{i n t e g r a t e d}$ containing 256 channels, scale factor 4

Structural Localization Component

M_{l o c} = σ ({Conv}_{1 \times 1} (BottleNeck (F_{i n t e g r a t e d})))

(4)

Damage Classification Component

F_{c l s} = Concat (F_{i n t e g r a t e d}^{c l s}, F_{o b j})

(5)

M_{d a m} = Softmax ({Conv}_{1 \times 1} (F_{c l s}))

(6)

F_{o b j}

originates from object-level features within the localization pathway.

3.4.4. MultiCrossEntropyDiceLoss Formulation

Cross-entropy and Dice loss components are integrated into a composite loss function used by DisasterOCS:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(7)

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} {\hat{y}}_{i, c} + ϵ}{\sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} + \sum_{i = 1}^{N} \sum_{c = 1}^{C} {\hat{y}}_{i, c} + ϵ}

(8)

The aggregate multi-objective loss becomes

L_{M u l t i C E D i c e} = λ_{C E} L_{C E} + λ_{D i c e} L_{D i c e}

(9)

L_{l o c} = MultiCrossEntropyDiceLoss (M_{l o c}, y_{l o c})

(10)

L_{d a m} = MultiCrossEntropyDiceLoss (M_{d a m}, y_{d a m})

(11)

L_{t o t a l} = α_{b l} L_{l o c} + α_{b d a} L_{d a m}

(12)

where

λ_{C E}, λ_{D i c e}, α_{b l}, α_{b d a}

represent tunable parameters for loss weighting and task equilibrium.

The reasoning behind the above composite loss function is to achieve a balance of pixel-wise classification stability and spatial region consistency. Cross-entropy loss is used to ensure the optimization of the entire set of damage classes, making sure that the minority classes are not neglected during training. Dice loss, on the other hand, focuses on the spatial overlap of the predicted and actual regions, which is especially important in disaster scenarios where the affected region of the image is relatively small. The weighting parameters

λ_{C E}, λ_{D i c e}, α_{b l}, α_{b d a}

are introduced to avoid the dominance of localization or damage classification in the optimization process and hence achieve a balance in the detection of the structure and learning of fine-grained damage severity.

It is important to clarify that the MultiCrossEntropyDiceLoss formulation is not introduced as a novel loss function. Instead, it is adopted as a practical design choice tailored for the multi-task setting of building localization and damage classification. The key contribution lies in the task-level weighting strategy, which balances localization and damage objectives and helps prevent smaller damage regions from being under-represented during training.

3.4.5. Object-Level Post-Processing

Connected component identification is carried out to ensure semantic consistency within individual structures, and final damage categories are determined by structural-level consensus voting. Object-level post-processing is shown in Algorithm 2.

The semantic consistency within the structural pixels is ensured through the object-level post-processing mechanism. Following the retrieval of the pixel-level structural mask

M_{l o c}

and damage mask

M_{d a m}

, connected component labeling is performed on

M_{l o c}

to retrieve the building instances. The labels of each building in

M_{d a m}

are then aggregated, followed by the application of a majority voting strategy to retrieve the final damage category. The final mask

M_{f i n a l}

is then obtained, in which each building is associated with a single damage category, removing noise in the damage labels of the same building.

Algorithm 2 Object-level post-processing

1:: Input: $M_{l o c}$ (structural mask), $M_{d a m}$ (damage mask)
2:: structural_objects ← ConnectedComponentLabeling( $M_{l o c}$ )
3:: for each object o in structural_objects do
4:: pixel_values ← ExtractPixels(o, $M_{d a m}$ )
5:: category_distribution ← TallyCategoryVotes(pixel_values)
6:: dominant_category ← argmax(category_distribution)
7:: $M_{f i n a l} [o] \leftarrow$ dominant_category
8:: end for
9:: Return: $M_{f i n a l}$ (semantically coherent damage mask)

3.5. Implementation Details

A ResNet-34 backbone with partial weight sharing between the pre-disaster and post-disaster encoding pathways was used to implement the DisasterOCS framework. Four hierarchical residual blocks make up the encoder, which maintains multi-scale feature representations. Multi-objective optimization was made possible by the use of task-specific decoders for damage classification and structural localization.

Resolution-preserving pre-processing and histogram normalization were applied to input imagery in order to minimize illumination discrepancies between temporal pairs. Both multi-class damage segmentation and localization mask supervision were performed using ground-truth annotations from the xBD dataset.

Dice loss and categorical cross-entropy components are combined in the MultiCrossEntropyDiceLoss function. To strike a balance between region-level spatial coherence and pixel-level classification stability, loss weighting parameters were chosen. To avoid localization and damage classification goals taking precedence over one another, task-level weighting coefficients were added.

Pixel-wise prediction was followed by object-level semantic consolidation. Semantic consistency across building pixels was enforced by majority voting, while structural instances were identified by connected component labeling.

3.6. Qwen2-VL 2B Analytical Processing

The Qwen2-VL 2B model creates structured disaster assessment data by utilizing the segmentation results and visual content. The LVLM module now processes three input streams,

I_{p r e}

,

I_{p o s t}

, and

M_{f i n a l}

, for a neat, comprehensive visual and contextual interpretation. The Qwen2-VL output visualization is shown in Figure 6.

Input Analysis: It understands original content and segmentation output.
Visual Content Analysis: An example could be identifying objects and their damage signatures, and other environmental contextual information.
Mask Interpretation: It presents the meaning of categories involved with classifications and their spatial distributions in an explanatory fashion.

It is noteworthy that the LVLM is provided with the post-processed mask

M_{f i n a l}

instead of the raw pixel-wise prediction

M_{d a m}

. In fact, the mask

M_{f i n a l}

is generated using majority voting in the object domain and thus ensures that all the structures in the scene are represented by a single dominant damage category. This effectively reduces the chance of propagation of errors in the multi-stage pipeline by reducing the effect of potential pixel-wise misclassifications occurring in the segmentation stage.

The following are some of the analytical results:

1.: Quantitative indicators: Total structure count, damaged amount, proportional damage, and destruction percentage, all of which are structured as JSON for easy integration.
2.: Detailed category analysis: Totals for each class (destroyed, major damage, minor damage, and intact).
3.: Severity assessment: A quick classification based on predetermined thresholds, such as CRITICAL, MODERATE, etc.

3.7. LLM Reporting Synthesis

For comprehensive reporting, the LLM module aggregates outputs (Algorithm 3):

Executive overview: A summary of the critical impact. Structured assessment: Comprehensive data that combines quantitative measures and visual interpretations.
Operational recommendations: Prioritized suggestions organized by area and level of damage.

Algorithm 3 LLM-based assessment workflow

Input:

(M_{l o c}, M_{f i n a l}), A_{l v l m}

1.: Extract metrics and interpret insights.
2.: Combine quantitative and contextual information.
3.: Generate an executive summary and operational guidance.

Output: Comprehensive assessment documentation.

3.8. Reproducibility and Code Availability

To ensure transparency and reproducibility, the complete implementation of DisasterReliefGPT, including training scripts, pre-processing pipelines, model configurations, and evaluation tools, is publicly available on GitHub (Version 1.0).Detailed setup instructions and experiment reproduction steps are provided in the repository documentation.

Code Repository: GitHub Repository (accessed on 10 March 2026).

4. Results and Analysis

4.1. DisasterOCS Performance Evaluation

Table 1 summarizes DisasterReliefGPT performance across damage categories, indicating balanced precision, recall, and F1 metrics with notable proficiency in undamaged and destroyed structure identification.Qualitative results are shown in Figure 7.

High precision rates along the principal diagonal, specifically 90.7% for undamaged structures and 81.3% for destroyed buildings, are visually validated in Figure 8, highlighting the significant classification accuracy achieved by DisasterOCS.

Standard xBD assessment procedures are followed in the evaluation protocol. Ground-truth damage annotations are directly compared to DisasterOCS’s pixel-by-pixel predictions. In order to prevent class-level statistics from being skewed, background pixels are not included in metric computation.

For every damage category, precision, recall, and F1-score are calculated separately. Metrics that represent building-level semantic correctness after object-level post-processing make sure that every structural instance makes a single, cohesive prediction. Micro F1 is a supplementary measure of global classification behavior, whereas macro-averaged F1 is said to be the main performance indicator to counteract the effects of class imbalance.

4.2. Comparative Discussion with Existing Approaches

While direct re-implementation of existing baseline models was beyond the scope of this study, the reported performance is contextualized using published xBD benchmark studies employing comparable evaluation protocols. The comparison emphasizes not only segmentation accuracy but also architectural efficiency, training stability, and interpretability considerations.

In contrast to purely vision-based systems, DisasterReliefGPT introduces an integrated multimodal reasoning and reporting pipeline. This design prioritizes operational usability by transforming segmentation outputs into structured analytical insights and human-readable assessments.

4.3. MultiCrossEntropyDiceLoss Efficacy

The joint MultiCrossEntropyDiceLoss balances the different classes for damage, sharpens the edges of the predicted regions due to the focus on overlap, and provides more stability to the training process. This is particularly helpful in complicated segmentation scenarios with highly imbalanced frequencies, such as disaster damage maps, where certain levels of damage may be much rarer than others.

4.4. Object-Level Post-Processing Outcomes

This object-based post-processing step helps to keep the damage label of each building consistent, enhances semantic consistency to 90.3%, and reduces noisy, pixel-level label fluctuations by 15.2%. The final damage maps now describe the condition of each structure in a more reliable way and are thus more useful for planning emergency response efforts at the level of individual buildings.

4.5. Component Contribution Discussion

Rather than being a group of separately optimized modules, the DisasterReliefGPT framework is intended to function as an end-to-end operational pipeline. Pixel-level predictions, such as damage classification and structural localization, are entirely the responsibility of the DisasterOCS component. The LVLM and LLM modules, on the other hand, only function at the post-inference stage, emphasizing interpretability, contextual comprehension, and report generation without changing segmentation outputs.

As a result, traditional quantitative segmentation metrics do not accurately represent the contributions of the LVLM and LLM. Rather, by converting unstructured predictions into organized insights, visual explanations, and narratives focused on making decisions, these modules improve the system’s operational value. While the LLM creates thorough evaluation reports appropriate for emergency response workflows, the LVLM enhances the semantic interpretation of segmentation masks and imagery.

Future research will involve systematic component-level evaluations to quantify interpretability gains, decision support effectiveness, and human-centered performance metrics, even though extensive quantitative ablation experiments were outside the purview of this study.

4.6. Assessment Findings Interpretation

DisasterReliefGPT exhibits the capability to accomplish the following:

Achieve beyond 90% recall for structural localization.
Segment damage severity effectively by using composite loss functions.
Integrate visual interpretation via the LVLM and structured reporting through the LLM.
Maintain uniform accuracy in a variety of geographical areas and disaster types.

It is crucial to remember that xBD itself includes a great deal of diversity in terms of disaster types, geographic locations, and building attributes, even though the suggested system was only tested on the xBD dataset. The inherent variability within xBD offers a reasonable approximation of domain heterogeneity, even though this study did not explicitly validate out-of-distribution datasets or across datasets. To further confirm the robustness and generalizability of the model, however, systematic evaluation on completely out-of-distribution datasets is still a crucial area for future research.

5. Conclusions and Future Work

This work introduced DisasterReliefGPT, an integrated multimodal artificial intelligence framework that combines LVLMs for visual interpretation of imagery and segmentation results, LLMs for natural language report generation, and DisasterOCS for computer vision-based damage segmentation using MultiCrossEntropyDiceLoss.

Principal achievements consist of the following:

1.: The DisasterReliefGPT architecture effectively integrates computer vision, vision–language, and language modeling.
2.: The DisasterOCS segmentation pipeline gets a damage F1-score of 78.8% with MultiCrossEntropyDiceLoss optimization.
3.: Effective multimodal integration is achieved, in which segmentation masks and source images are processed by the LVLM.
4.: Semantic coherence within structural segments is guaranteed by object-aware post-processing.
5.: Comprehensive validation shows that it is useful for emergency response operations.

Overall, the contribution of this work lies in demonstrating how semantic change detection, object-level reasoning, multimodal interpretation, and automated reporting can be combined into a unified and operational workflow. This positioning highlights the practical and engineering-focused nature of the proposed system.

The system’s effectiveness is confirmed by the classification matrix analysis, which achieves 81.3% accuracy in identifying destroyed buildings and 90.7% accuracy in identifying undamaged structures. The simplified design is quick enough for real-time disaster response operations while striking a compromise between technical correctness and human comprehension.

Despite the promising results, several limitations remain. The current study evaluates the system primarily on a single benchmark dataset and does not include cross-dataset validation. In addition, explicit uncertainty estimation is not yet incorporated into the pipeline, which could further support decision-making in high-risk scenarios. Finally, the multi-stage architecture introduces additional computational overhead, which may impact deployment in resource-constrained environments. Addressing these aspects forms an important direction for future work.

Author Contributions

Conceptualization, A.S.A., L.C.R., C.N. and C.R.; methodology, A.S.A., C.N. and C.R.; software, A.U., A.C. and A.K.R.; validation, A.S.A. and A.U.; formal analysis, A.K.R. and A.C.; investigation, A.S.A., L.C.R. and C.R.; resources, A.U.; data curation, A.U. and A.C.; writing—original draft preparation, A.S.A. and A.U.; writing—review and editing, L.C.R.; visualization, A.K.R.; supervision, C.R.; project administration, C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

No participants were involved in this study. Hence, ethical approval, informed consent, and adherence to institutional or licensing regulations are not applicable.

Data Availability Statement

The datasets analyzed during the current study are available at https://xview2.org/ (accessed on 20 September 2025).

Conflicts of Interest

The authors declare no potential conflicts of interest.

Abbreviations

AI	Artificial Intelligence
CNN	Convolutional Neural Network
CE	Cross-Entropy
FPN	Feature Pyramid Network
LLM	Large Language Model
LVLM	Large Vision–Language Model
MCED	MultiCrossEntropyDiceLoss
OBIA	Object-Based Image Analysis
xBD	xView2 Building Damage Dataset

References

Singh, A. Review article: Digital change detection techniques using remotely-sensed data. Int. J. Remote Sens. 1989, 10, 989–1003. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI); Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 4th International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Tversky, A. Features of similarity. Psychol. Rev. 1977, 84, 327–352. [Google Scholar] [CrossRef]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Wu, K.; Otoo, E.; Suzuki, K. Optimizing two-pass connected-component labeling algorithms. Pattern Anal. Appl. 2009, 12, 117–135. [Google Scholar] [CrossRef]
Chen, Z.; Asadi Shamsabadi, E.; Jiang, S.; Shen, L.; Dias-da-Costa, D. Integration of Large Vision-Language Models for Efficient Post-Disaster Damage Assessment and Reporting. arXiv 2024, arXiv:2411.01511. [Google Scholar] [CrossRef] [PubMed]
Al Shafian, S.; Hu, D. Integrating ML and Remote Sensing in Disaster Management. Buildings 2024, 14, 2344. [Google Scholar]
Kopiika, N.; Karavias, A.; Krassakis, P.; Ye, Z.; Ninic, J.; Shakhovska, N.; Mitoulis, S.A. Rapid post-disaster infrastructure damage characterisation enabled by remote sensing and deep learning. arXiv 2024, arXiv:2401.17759. [Google Scholar] [CrossRef]
Wang, J.; Xuan, W.; Qi, H.; Liu, Z.; Liu, K.; Wu, Y.; Yokoya, N. DisasterM3: A Remote Sensing Vision-Language Dataset. arXiv 2025, arXiv:2505.21089. [Google Scholar]
Calantropio, A.; Chiabrando, F.; Codastefano, M.; Bourke, E. Deep Learning for Automatic Building Damage Assessment. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 1, 113–120. [Google Scholar]
Gupta, R.; Hosfelt, R.; Sajeev, S.; Patel, N.; Goodman, B.; Doshi, J.; Heim, E.; Choset, H.; Gaston, M. xBD: A dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Long Beach, CA, USA, 2019. [Google Scholar]
Gupta, R. xView2: Assessing Damage from Satellite Imagery: A Case Study from Turkey and Syria; Pacific Earthquake Engineering Research Center (PEER), University of California: Berkeley, CA, USA, 2023. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]

Figure 1. The xBD dataset’s global distribution of disaster events by continent. The model’s ability to adjust to various urban development patterns, climates, and building techniques is ensured by the color-coded differences between North America, South America, Europe, Asia, and Oceania.

Figure 2. Six disaster categories—seismic events, wildfires, volcanic eruptions, storms, floods, and tsunamis—are represented by representative samples from the xBD dataset that show the progression from pre-disaster (upper row) to post-disaster (middle row) to structural damage annotation (lower row). Reliable model training and validation for a variety of disasters is made possible by this comprehensive representation, which emphasizes the richness and diversity of the datasets.

Figure 3. The DisasterReliefGPT system architecture shows how the LVLM, LLM, and DisasterOCS components are integrated. The LVLM module analyzes both original images and segmentation masks, while the LLM module synthesizes and produces final evaluation reports utilizing all preceding component outputs. The architecture enables multi-stage processing from input preparation and computer vision through vision–language modeling to executive summary generation.

Figure 4. The DisasterOCS architecture uses multi-objective decoding for both localization and classification, lateral pyramid connections for contextual information, and a partial shared-weight encoder for temporal imagery. For the assessment of structural presence and damage severity, this kind of configuration improves feature sharing and segmentation accuracy.

Figure 5. DisasterOCS semantic change identification workflow. Shows how to use object localization, damage classification, contextual feature extraction, image pairs, and object-aware post-processing to train a model. Every stage in the pipeline is made to function quickly without sacrificing accuracy.

Figure 6. Qwen2-VL 2B output illustration. Displays pre–post-disaster satellite imagery, structural segmentation, color-coded mask overlays, damage categorization, and calculated statistical summaries. This type of organized visual representation communicates actionable assessment intelligence to emergency response units.

Figure 7. Pre-disaster photography, post-disaster imagery, ground truth annotations, and model forecasts are the four viewpoints from which structural damage is compared. The qualitative correspondence demonstrates DisasterReliefGPT’s high-fidelity damage identification capabilities.

Figure 8. The DisasterOCS damage categorization classification matrix consistently performs with high precision in all categories, particularly for structural classifications that are undamaged and destroyed.

Table 1. DisasterReliefGPT performance indicators by damage category.

Damage Category	Precision	Recall	F1 Metric
No Damage	0.907	0.889	0.912
Minor Damage	0.617	0.703	0.657
Major Damage	0.804	0.757	0.780
Destroyed	0.813	0.841	0.827
Macro Mean	0.793	0.798	0.794
Micro F1	0.841
Aggregate Damage F1	78.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Reghunath, L.C.; Abhishek, A.S.; Changat, A.; Unnikrishnan, A.; Rai, A.K.; Napoli, C.; Randieri, C. DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication. Technologies 2026, 14, 179. https://doi.org/10.3390/technologies14030179

AMA Style

Reghunath LC, Abhishek AS, Changat A, Unnikrishnan A, Rai AK, Napoli C, Randieri C. DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication. Technologies. 2026; 14(3):179. https://doi.org/10.3390/technologies14030179

Chicago/Turabian Style

Reghunath, Lekshmi Chandrika, Athikkal Sudhir Abhishek, Arjun Changat, Arjun Unnikrishnan, Ayush Kumar Rai, Christian Napoli, and Cristian Randieri. 2026. "DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication" Technologies 14, no. 3: 179. https://doi.org/10.3390/technologies14030179

APA Style

Reghunath, L. C., Abhishek, A. S., Changat, A., Unnikrishnan, A., Rai, A. K., Napoli, C., & Randieri, C. (2026). DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication. Technologies, 14(3), 179. https://doi.org/10.3390/technologies14030179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DisasterReliefGPT: Multimodal AI for Autonomous Disaster Impact Assessment and Crisis Communication

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset Description

3.2. System Architecture

3.3. Problem Formulation

3.4. DisasterOCS Segmentation Framework

3.4.1. Partially Shared Encoder Architecture

3.4.2. Task-Specific Contextual Encoding

3.4.3. Multi-Objective Decoding

Feature Integration

Structural Localization Component

Damage Classification Component

3.4.4. MultiCrossEntropyDiceLoss Formulation

3.4.5. Object-Level Post-Processing

3.5. Implementation Details

3.6. Qwen2-VL 2B Analytical Processing

3.7. LLM Reporting Synthesis

3.8. Reproducibility and Code Availability

4. Results and Analysis

4.1. DisasterOCS Performance Evaluation

4.2. Comparative Discussion with Existing Approaches

4.3. MultiCrossEntropyDiceLoss Efficacy

4.4. Object-Level Post-Processing Outcomes

4.5. Component Contribution Discussion

4.6. Assessment Findings Interpretation

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI