Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data

Jiang, Jing; Wang, Yifei; Manfredini, Manfredo

doi:10.3390/su18062957

Open AccessArticle

Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data

by

Jing Jiang

¹

,

Yifei Wang

^2,*

and

Manfredo Manfredini

¹

School of Architecture and Planning, Faculty of Engineering and Design, The University of Auckland, Auckland 1010, New Zealand

²

School of Computer Science, Faculty of Science, The University of Auckland, Auckland 1010, New Zealand

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(6), 2957; https://doi.org/10.3390/su18062957

Submission received: 30 November 2025 / Revised: 10 March 2026 / Accepted: 13 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Innovative Technologies and Strategies in Disaster Management, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Flood risk maps play a critical role in land-use regulation, infrastructure planning, and community preparedness, which are key components of sustainable and climate-resilient urban development. Their production, however, remains costly, labor-intensive, and time-demanding as it relies on simulation-driven workflows that combine hydrodynamic modeling with expert interpretation and extensive validation. To address this issue from a sustainability perspective, we develop a novel, practical, and near-real-time large language model (LLM)-based framework to support property-level flood risk assessment. This framework, which synthesizes geospatial, hydrological, infrastructural, and historical flood information, extends existing research and explores novel risk estimation methods for use in planning practice. Using Brisbane, Australia, as a case study, we develop Flood-LLM, a multi-agent system that transforms multi-source urban datasets into structured textual representations, models diverse neighborhood conditions, and fine-tunes a reasoning model using expert-assessed risk classifications. The results show that Flood-LLM can reproduce official flood risk labels for creek, river, storm tide, and overland-flow hazards with reasonable accuracy, outperforming classical machine learning, deep learning, and untuned LLM baselines. Visual and quantitative analyses indicate that the framework demonstrates a qualitatively nuanced capability to capture salient spatial patterns present in the official maps, while generating a textual chain-of-thought providing a transparent audit trail for its labeling decisions. These findings suggest that such LLM-based approaches can produce potential complementary tools to expert-reviewed planning classifications and support more sustainable, adaptive flood risk management by enabling timely map production and updates that facilitate informed decision-making in rapidly changing environmental conditions.

Keywords:

flood risk assessment; large language models; geospatial data integration; disaster management

1. Introduction

Floods remain one of the most disruptive and recurrent natural hazards worldwide, with increasing frequency and severity driven by climate change, rapid urbanization, and aging drainage infrastructure. In response, flood management has shifted from a focus on structural protection toward a more integrated flood risk management (FRM) paradigm that emphasizes vulnerability reduction, informed decision-making, and community resilience [1,2]. Within this paradigm, flood risk mapping is an essential tool for development guidance, preparedness planning, and public communication, and is increasingly regarded as a key instrument for supporting sustainable and climate-resilient urban development. Unlike traditional inundation mapping that describes physical characteristics such as extent, depth, and duration [3], flood risk mapping synthesizes modeled flood behavior with contextual information, including historical flood records, surface and subsurface interactions, infrastructure conditions, and expert judgment to produce categorical planning-oriented risk levels [4,5]. Importantly, these outputs are not direct products of raw hydrodynamic simulations. Rather, they represent interpreted regulatory layers in which simulation results are integrated with professional judgment, local contextual knowledge, and policy considerations for urban planning.

Producing flood risk maps, however, is inherently challenging. Urban flooding results from the complex interaction of multiple domains, including topography, rainfall–runoff dynamics, drainage-network performance, and soil permeability [6]. As a result, operational workflows in regions such as Australia and Europe, depend on computationally intensive hydrodynamic simulations, expert calibration, and multiple rounds of validation [7,8]. These methods provide effective, empirically grounded results but remain resource-demanding, difficult to scale, and heavily dependent on intra- and inter-domain expertise. Critically, such workflows often struggle to keep pace with changes in land use, surface characteristics, and infrastructure configuration caused by rapid urban densification, expansion, or redevelopment. Empirical models such as Random Forest, CatBoost, CNNs, or GNNs have been increasingly adopted to complement or accelerate such tasks [9,10,11]. Yet, despite progress, existing data-driven approaches continue to face three major limitations.

First, multi-domain heterogeneity remains difficult to model. Flood risk assessment depends simultaneously on geographical, hydrological, infrastructural, and historical datasets that often use incompatible formats and scales. Conventional vector-based machine learning (ML) and deep learning (DL) models struggle to integrate these heterogeneous inputs coherently [12,13]. Second, spatial neighborhood, integrating micro-, meso-, and macro-factors, is not easy to model. The relevant contextual area for assessing a property’s flood risk may extend far beyond immediate adjacency, while the combined effect of multiple granular interventions varies significantly across locations [14]. Models that either truncate the neighborhood or treat it uniformly risk losing important local cues related to micro-topography, upstream flow paths, or drainage-network connectivity. Third, interpretability remains limited. Many state-of-the-art models operate as black boxes, offering little transparency regarding how input factors contribute to predicted risk levels [15]. This lack of explainability constrains their reliability for decision-makers, practitioners, and the public. A summary of representative studies in this area is provided in Table 1.

These persistent limitations motivate the exploration of complementary analytical methods, such as large language models (LLMs), which explore new directions for flood risk mapping. LLMs possess advantages that are not available in traditional ML/DL architectures: they can integrate heterogeneous information expressed in natural language, reason across domains, and generate interpretable chains of thought. Recent research has shown that LLM-based systems can emulate expert reasoning patterns in hydrology and disaster assessment [27], suggesting their potential to support risk-mapping workflows that require extensive contextualization.

To address the challenges outlined above, we have developed Flood-LLM, a multi-agent LLM-based framework designed to construct representations of diverse geospatial inputs that summarize spatial neighborhoods, and learn from expert-interpreted planning mapping. Rather than modeling physical flood dynamics, the Flood-LLM framework explores how an LLM-based approach can use heterogeneous spatial, environmental, and infrastructural indicators to support flood-risk classifications for planning mappings. In this sense, our aim to support the formulation of complementary data-driven methods creates a novel integrated framework that indicates its potential capacity by validation against the label assignment patterns of official flood maps. Accordingly, this paper addresses the following research questions: RQ1: Can LLMs integrate heterogeneous geospatial, hydrological, drainage-network, and historical flood data to approximate the property-level flood risk classifications used in planning maps? RQ2: How can LLM-based systems incorporate diverse and uneven spatial neighborhoods when learning for mappings? RQ3: To what extent can LLM-generated outputs provide interpretable insights into how these planning classifications adequately associate heterogeneous urban indicators?

To answer these questions, Flood-LLM introduces three coordinated agents: a Relevant Info Agent that gathers and narrates multi-domain inputs; a Div-max neighbor Info Agent that extracts semantically diverse neighborhood contexts; and a Learnable Estimation Agent fine-tuned via LoRA (low-rank adaptation) to approximate expert-informed planning labels (See Figure 1). The framework enables cross-domain reasoning without requiring explicit vectorization and generates interpretable reasoning chains alongside risk predictions.

To clarify the intended scope of Flood-LLM, we submit that Flood-LLM offers neither an alternative to hydrodynamic simulation nor a tool for direct statutory or regulatory decision-making. Instead, Flood-LLM is introduced as an exploratory methodological framework to deploy large language models within complex mapping processes. By integrating heterogeneous spatial datasets, the framework facilitates the generation of expert-interpreted planning labels. It is conceived as an exploratory investigation into whether such an approach could serve as a multi-level support tool for practical, effective, and rapid productive flood risk analyses, which are particularly needed in situations where full physical modeling is constrained by insufficient data availability, processing time, or expert/computational resources. By operationalizing different layers of the flood-risk knowledge system as proxy indicators, Flood-LLM provides planning-oriented risk indications more rapidly than conventional workflows.

This learning paradigm necessarily conditions the model’s behavior on the characteristics of the planning classifications on which it is trained, and therefore may also reflect the uncertainties and limitations inherent in those underlying data sources. However, it also reveals a positive implication for transferability, as it suggests the potential for adaptation to different spatial contexts through retraining on locally defined classifications. Crucially, if approaches of this kind were ever to be considered for practical use in the future, such adaptation would require careful attention to issues of expert oversight, ethical governance, and accountability.

The contribution of this study, therefore, lies in exploring the potential of a new technological approach for assisting early-stage assessment, rapid updating, and data-constrained analysis. Specifically, the main contributions of this paper are as follows: (1) we explore the feasibility of an LLM-based framework for approximating property-level flood risk classifications from heterogeneous multi-source urban data; (2) we design a diversity-based neighborhood information mechanism to examine how spatial dependencies among properties can be incorporated into such a framework while remaining computationally tractable; (3) we investigate whether Flood-LLM can reproduce high-resolution planning-oriented flood risk labels that are consistent with expert-reviewed classifications; and (4) we conduct a systematic empirical evaluation using real-world datasets from a salient case, urban Brisbane, including ablation studies, neighborhood parameter analysis, and comparative benchmarking to assess the behavior and limitations of the approach.

The remainder of this paper is organized as follows. Section 2 introduces the study area, data, task formulation, and the proposed Flood-LLM framework, together with implementation details. Section 3 presents the experimental results, including comparisons with classical ML, DL, and LLM baselines, quantitative and visual evaluations against the Council flood maps, and additional analyses such as an ablation study, parameter sensitivity, and illustrative reasoning cases. Section 4 discusses the implications, opportunities, and limitations of LLM-based flood risk assessment. Section 5 concludes the paper and outlines future research directions.

2. Materials and Methods

This section describes the study area, data sources and preprocessing steps, the formulation of the flood risk estimation task, and the proposed Flood-LLM framework, followed by the experimental design.

2.1. Study Area

The empirical study focuses on the Brisbane metropolitan area, a major coastal city in Queensland, Australia, that is highly exposed to multiple flood hazards. Brisbane is affected by riverine flooding along the Brisbane River and its tributaries, local creek flooding in smaller catchments, storm tide flooding in low-lying coastal areas, and intense rainfall generating overland-flow flooding across the urban fabric. The city has experienced several major flood events in recent decades, including those in 1974, 2011, and 2022, which have led to recurrent updates of flood awareness policies and maps by Brisbane City Council (BCC). This context provides a rich and policy-relevant testbed for exploring property-level flood risk estimation.

2.2. Data Sources and Preprocessing

This study integrates multiple geospatial datasets to derive property-level flood risk indicators and corresponding supervision signals for Brisbane.

Generally, property parcels, topography, hydrological systems, drainage infrastructure, and historical flood extents are combined to structured property-level feature representations used as model inputs. Property parcels serve as the fundamental spatial unit for analysis, with neighborhoods defined through parcel intersections that provide a spatial scaffold for aggregating contextual information. Within this framework, hydrological and topographic data approximate upstream–downstream influences, drainage infrastructure captures functional stormwater connectivity, and historical flood extents encode empirical exposure patterns, enabling neighborhood interactions to reflect not only geometric proximity but also hydrologically informed relationships (See Figure 1).

In addition, flood risk labels are derived from the BCC’s Flood Awareness Maps, which serve as the official reference for planning and flood risk communication and are used in this study to construct property-level flood risk representations as model targets. These maps synthesize Council-endorsed flood studies and modeling outputs to delineate the extents and likelihoods of four flood types, including creek, river, storm tide, and overland flow flooding, with risk layers derived from approved hydrodynamic and overland flow models. Property parcels serve as the fundamental spatial units to which flood risk labels are assigned. Each parcel is assigned a four-component flood risk label, which provides the ground truth for model training and performance evaluation.

2.2.1. Property Parcels and Elevation

The primary spatial dataset is the BCC Property Parcel dataset, provided in Shapefile vector format. The dataset represents individual land parcels as polygon geometries, where each polygon is defined by an ordered set of planar coordinates forming a closed shape that delineates the legally defined boundary of a property. Each polygon is linked to an attribute table containing descriptive information for the corresponding parcel, including cadastral identifiers, address details, land tenure, and lot area. The shapefile is referenced to a local projected grid system, the Map Grid of Australia Zone 56 (MGA56), based on the Geocentric Datum of Australia 1994 (GDA94) (All coordinates used in this study follow this coordinate system), enabling accurate spatial measurement and integration with other spatial layers. In total, the dataset comprises 519,009 property parcels, providing a city-wide, parcel-level representation of Brisbane’s land subdivision structure (See Figure 2).

Elevation information in this study was obtained from the Brisbane Contour Map dataset accessed via the Queensland Spatial Catalogue. The source data are provided as vector contour lines, where each polyline represents a line of constant elevation above sea level and carries an elevation attribute. To enable property-level elevation attribution, the contour data were processed in ArcGIS Pro 3.4 to generate a raster elevation surface, in which elevation is represented as a grid of regularly spaced cells storing continuous height values. This rasterization step allows elevation values to be queried at specific point locations. The resulting raster dataset is referenced to the coordinate framework as the property parcel data, ensuring spatial consistency for subsequent analysis (See Figure 3a). Thus, the elevation of each property’s centroid was used as its assigned elevation value.

In practice, property centroid coordinates are utilized as the input for all models, including our Flood-LLM and all baselines. To clarify the notation used throughout this paper, we let

e_{i}

represent a single property and define the following:

\begin{matrix} E : = \{e_{1}, e_{2}, \dots, e_{n}\}, \\ where e_{i} represents a single property . \end{matrix}

(1)

The complete set of properties under consideration, where

n \in N^{+}

denotes the total number of properties and

i \in [1, n] \cap N^{+}

indexes the i-th property. For each property

e_{i}

, the centroid coordinates

X_{i}^{co}

, the centroid coordinates

X_{i}^{co}

is utilized as AI model inputs:

\begin{matrix} X_{i}^{co} : = [co_e_{i}, co_n_{i}, co_h_{i}] . \\ co_e_{i} : Easting Coordinate of e_{i}; \\ co_n_{i} : Northing Coordinate of e_{i}; \\ co_h_{i} : Ellipsoidal Height of e_{i} . \end{matrix}

(2)

Specifically, while coordinate values can be directly used by classical ML and DL models, LLM-based approaches require transferring the numerical inputs into texts. In practice, this converted process is performed with Python 3.10’s built-in str() function (Similar string-conversion functions are widely available in programming languages and are commonly used to transform numerical or structured data into text. The specific implementation details are simplified here as they do not affect the discussion).

2.2.2. Hydrological and Drainage Infrastructure Data

To characterize Brisbane’s surface and subsurface drainage systems, several additional datasets, sourced from the BCC Open Data Portal in Shapefile format, are incorporated, all referenced to the same projected coordinate system for seamless overlay and network analysis. The waterway network dataset comprises 18,227 polyline features, where each polyline is defined by an ordered sequence of coordinates representing the centreline of a natural or semi-natural watercourse (See Figure 3b). These linear geometries capture the spatial paths of creeks, streams, and drainage channels with either intermittent or permanent surface flow.

The gully dataset contains 167,970 point features that indicate the locations of stormwater inlets where surface runoff enters the drainage system, each point defined by planar coordinates. The drainage pipe dataset consists of 280,229 polyline features representing the spatial alignment of underground stormwater pipes, with each polyline tracing the path of a subsurface conduit (See Figure 4).

The unstructured geometries and heterogeneous formats of waterways, gullies, and drainage pipes make them difficult to encode using standard feature representations. To preserve this information for subsequent LLM queries, we store it using a Polygon–Text data structure. Let

S

denote the complete set of hydrological features, including waterways, gullies, and pipes. Let

s_{j} \in S

denote an individual entity of them. Within a geopandas framework, each such entity is represented as a polygon, defined by a sequence of points outlining its boundary. A specific polygon can be mapped to a set of coordinates, where each point is represented by a 3-dimensional coordinate vector. Thus, for entity

s_{j}

, we define the following:

{Polygon}_{j} \in {(R^{3})}^{*},

(3)

where

{(R^{3})}^{*}

denotes the positive closure over

R^{3}

, representing a finite but unbounded sequence of 3-dimensional coordinate vectors. This allows the representation of polygons with any number of vertices. Additional metadata for entity

s_{j}

is stored as follows:

{Info}_{j} \in T^{*} .

(4)

Here,

T

denotes the vocabulary of text tokens and

T^{*}

denotes a sequence of textual tokens. The content of

{Info}_{j}

varies by corresponding

s_{j}

. For a waterway,

T^{*}

may include fields such as name and water type, whereas for a pipe, it may contain width, material, and construction date. The length of

{Info}_{j}

is variable, depending on the richness of the published data from the BCC.

The variable nature of both

{Polygon}_{j}

and

{Info}_{j}

renders them incompatible with conventional machine learning or deep learning models, which typically require fixed-dimension vector inputs. For LLM utilization, we design an agentic framework comprising an LLM and specialized tools (detailed in Section 2.4.1) to dynamically retrieve and incorporate relevant information into the model’s input context.

2.2.3. Historical Flood Records

Historical flood information was also obtained from BCC’s open data portal in shapefile format, which provides mapped flood inundation extent datasets documenting the observed spatial footprint of flooding for three major events in 1974, 2011, and 2022 (See Figure 5). Each flood event layer consists of a set of polygon features whose geometries delineate the mapped boundaries of areas that were underwater during that event. The 1974 flood record is represented by a single polygon capturing the overall inundation extent, whereas the 2011 and 2022 datasets comprise 802 and 776 polygons, respectively, reflecting a more spatially detailed delineation of inundated areas.

In the implementation, the historical record is represented as vectors, defined by the following:

\begin{matrix} X_{i}^{r e} : = [rec_1974_{i}, rec_2011_{i}, rec_2022_{i}], \\ where rec_{(year)}_{i} : = \{\begin{matrix} 1, & if property i is flood at the year; \\ 0, & otherwise . \end{matrix} \end{matrix}

(5)

2.2.4. Flood Risk Labels

Flood risk labels are derived from BCC’s Flood Awareness Maps, which are planning-oriented classification products that integrate modeling outputs, historical studies, and regulatory interpretation into discrete exposure zones intended for public communication and development assessment.

Specifically, it delineates exposure zones for four flood types, creek, river, storm tide, and overland flow. For each of them, the Flood Awareness Maps are provided as a separate Shapefile-based vector dataset, in which a collection of polygon features represents flood exposure. Each polygon delineates a geographic zone of uniform flood risk for a given flood type and is accompanied by attribute information specifying the flood type and its associated risk level. In terms of data volume, the creek flood dataset contains 45,546 polygons, the river flood dataset 5308 polygons, the storm tide dataset 34,808 polygons, and the overland-flow dataset 1,246,568 polygons, indicating substantial variation in the spatial granularity with which different flood processes are mapped.

Across all flood types, BCC derives risk levels from flood study modeling outputs associated with annual exceedance probability (AEP) scenarios, which are then translated through expert interpretation into discrete planning categories of likelihood or impact. Specifically, for creek, river, and storm tide floods, risk levels are expressed as AEP categories: high (5%), medium (1%), low (0.2%), and very low (0.05%). For overland-flow flooding, risk is defined in terms of impact: high (5%), medium (2%), and low (1%).

In this study, the “low” and “very low” categories are merged for consistency across flood types. This merging reflects how these strata are operationalized in practice. Although the Flood Awareness Maps distinguish between “low” and “very low”, both categories are consolidated into a single 0.2% AEP designation in the Council’s “FloodWise Property Report”, which is the primary parcel-level reporting instrument supporting development assessment in Brisbane. Similarly, the Council’s public guidance document “Flooding in Brisbane: A Guide for Residents” [28] issues identical recommendations for residents in the two strata, indicating functional equivalence for household preparedness and residential decision-making. Therefore, merging the two categories aligns the experimental taxonomy with their real-world usage and preserves the ordinal structure of flood severity (high > medium > low) across all flood types.

This practical alignment, however, reduces label granularity and may diminish the model’s sensitivity to marginal flood risk, particularly where subtle variations occur at the urban fringe or in redeveloping areas, potentially limiting its ability to capture such transitional patterns.

In addition, it is important to recognize that the flood risk labels carry inherent uncertainties and limitations because the underlying map layers were developed from separate modeling studies conducted at different times and are neither synchronously nor continuously updated. For example, some statistical likelihood layers originate from earlier studies, such as the official creek and storm tide layers from 2012 and 2013 and the overland flow layer from 2017, while historical flood extents, including those associated with the 2022 Brisbane River crest, are based on event-specific observations. Rather than being refreshed through regular physical modeling cycles, the map is updated on a project basis when new flood study results become available, as illustrated by the 2025 revision, which updated the creek flood mapping for over 17,000 properties. These variations in data sources and update cycles may influence Flood-LLM by introducing inconsistencies that become embedded in the learned pattern, and therefore should be considered when interpreting the model’s behavior.

All risk levels are encoded numerically in this study as follows: 0 (no exposure), 1 (low), 2 (medium), and 3 (high) following the rule defined by BCC [29,30,31,32]. To assign these flood risk labels to individual properties, a spatial join was conducted in ArcGIS (A spatial join is a location-based matching process in which attributes from one spatial layer are transferred to another based on their geometric relationship). As all spatial layers are provided with coordinates, property parcel polygons and flood risk polygons can be directly overlaid without coordinate transformation.

The procedure was applied separately to each flood type by spatially joining the property parcel layer with the corresponding flood risk layer. For each flood type, a property was assigned the highest intersecting risk level if it overlapped with one or more flood risk polygons, and 0 (no exposure) if no intersection occurred. Repeating this process across the four flood types yields a four-dimensional flood risk vector for each property, with one risk level per flood type. The distribution of properties across risk levels for each flood type is summarized in Figure 6.

For each property

e_{i}

, the corresponding 4 labels can be denoted by the following:

\begin{matrix} y_{i} : = [{label}_{i}^{Creek}, {label}_{i}^{River}, {label}_{i}^{Storm}, {label}_{i}^{Overland}], \\ {label}_{i}^{Creek} \in {0, 1, 2, 3} : Creek flooding level of property e_{i}, \\ {label}_{i}^{River} \in {0, 1, 2, 3} : River flooding level of property e_{i}, \\ {label}_{i}^{Storm} \in {0, 1, 2, 3} : Storm tide flooding level of property e_{i}, \\ {label}_{i}^{Overland} \in {0, 1, 2, 3} : Overland flow level of property e_{i} . \end{matrix}

(6)

Then, the overall dataset can be defined as follows:

\begin{matrix} Y : = \{(e_{i}, y_{i}) ∣ e_{i} \in E\} . \end{matrix}

(7)

To ensure a strict separation between model training and evaluation, we adopt a spatial hold-out strategy [33] when using the flood risk labels. Brisbane comprises 190 suburbs, among which 38 suburbs are randomly selected as the training region, while the remaining 152 suburbs are reserved exclusively for testing.

Let

{Sub}_{train}

denote the set of training suburbs and

{Sub}_{test}

denote the set of testing suburbs, such that,

{Sub}_{train} \cap {Sub}_{test} = \emptyset .

(8)

Each property

e_{i} \in E

belongs to a suburb

s_{i}

, and the training and testing sets of properties are defined as

E_{train} = {e_{i} \in E ∣ s_{i} \in {Sub}_{train}}, E_{test} = {e_{i} \in E ∣ s_{i} \in {Sub}_{test}} .

(9)

Using expert-provided flood risk labels

y_{i}

, the corresponding datasets are

Y_{train} = {(e_{i}, y_{i}) ∣ e_{i} \in E_{train}}, Y_{test} = {(e_{i}, y_{i}) ∣ e_{i} \in E_{test}} .

(10)

Under this partition, among the

519,009

properties in Brisbane,

103,757

properties are assigned to the training set and

415,252

properties to the test set. Properties located in the training suburbs provide labeled examples for model learning, whereas the flood risk labels in the test suburbs are retained exclusively as ground truth for evaluation.

This design simulates realistic operational conditions for parcel-level flood risk mapping, where the physical environment of the entire city (e.g., terrain, drainage systems, and waterway geometry) is observable, but expert-reviewed flood risk labels are available only for a limited subset of locations. Under such conditions, the model can serve as a fast pre-estimation tool to provide preliminary risk assessments before comprehensive expert evaluation becomes available.

2.3. Task Formulation

Flood risk maps are widely used as planning instruments for long-term disaster risk management and sustainable urban development. They are typically developed through workflows that combine hydrodynamic modeling with expert interpretation and local knowledge to assign flood risk classifications to individual properties. Updating these maps in response to changes in properties, infrastructure, or environmental conditions can be time-intensive, as the process often involves repeated modeling, data integration, and expert assessment. As a result, the production and updating of such maps may constrain the timely availability of planning-oriented risk information needed to support adaptive and sustainable planning decisions.

To explore the feasibility of emerging AI-based approaches in flood risk analysis, this study examines whether an LLM-based framework can learn the label assignment patterns embedded in expert-reviewed flood risk maps from heterogeneous urban and environmental indicators. It is presented purely as an exploration of new methodological possibilities that may, in the future, offer potential as preliminary or pre-screening support tools for more timely and sustainable flood risk management, subject to careful oversight, accountability, and ethical considerations. Here, we model expert decision-making as an unknown mapping function, denoted as f, that satisfies for any

e_{i} \in E

:

f (e_{i}) = y_{i},

(11)

which assigns each property a vector of flood risk levels across the considered flood types. In practice, the function f is governed by expert knowledge and heuristics, and its explicit form is typically unknown. Only a limited set of property–risk level pairs provided by experts following this function is available.

To approximate this unknown mapping using machine learning techniques, we seek to learn a function

g_{θ}

parameterized by

θ

:

\begin{matrix} g_{θ} (e_{i}) = {y_{i}}^{'}, where {y_{i}}^{'} denotes the g_{θ} predicted flood levels or e_{i} . \end{matrix}

(12)

Suppose there exists a

\hat{θ}

such that the model

g_{θ}

most closely approximates the expert decision-making function f. Given the training data

Y_{train}

used in the learning process, the estimation of

\hat{θ}

can be formulated as the following optimization problem:

\hat{θ} = arg min_{θ} \sum_{e_{i} \in Y_{train}} ∥f (e_{i}) - g_{θ} (e_{i})∥,

(13)

where

∥ \cdot ∥

denotes the norm of the input. We can then obtain the output of the learned model

g_{\hat{θ}}

on the test set

Y_{test}

, which denotes the properties for which expert-labeled flood levels are not available:

{\hat{y}}_{j} = g_{\hat{θ}} (e_{j}) .

(14)

Using the predicted flood levels

g_{\hat{θ}} (e_{j})

enables efficient estimation of expert-assessed flood risk levels for each property

e_{j} \in Y_{test}

, thereby facilitating the rapid generation of preliminary flood risk maps. Thus, we formulated the flood risk assessment task as the optimization of

\hat{θ}

. In the following Section 2.4, we’ll show the technical implementation to solve this optimization (Equation (37)).

2.4. LLM Approach for Flood Risk Assessment

Considering the capacity of large language models to integrate heterogeneous information through textual reasoning, we develop an LLM-based framework for flood risk assessment, termed Flood-LLM. The framework comprises three key components. First, a Related Info Agent constructs a structured context of property-specific information from heterogeneous data sources. Second, a Div-max Neighbor Info Agent incorporates contextual information from surrounding properties to represent relevant neighborhood conditions. Finally, a Learnable Estimation Agent employs a training strategy that enables the model to learn expert-informed labeling patterns and provides a textual reasoning trace that supports interpretability of the estimation process.

2.4.1. Relevant Info Agent

Advances in information technology have enabled the widespread collection of digital representations of real-world properties by various organizations. These geospatial, hydrological, and infrastructural datasets provide a critical foundation for flood risk estimation. However, conventional information methods often struggle to supply downstream decision-making models with enough domain-specific context in limited input length.

To address this challenge, we develop an LLM agent (See Figure 7) that leverages specialized geometric operations from the open-source geometry processing library geopandas [34]. This agent retrieves and summarizes property-relevant information to augment the prompt context for downstream flood risk estimation. The system is designed to be highly modular, such that adding or removing specific data sources does not disrupt the overall workflow.

With Polygon-Text saved as shown in Section 2.2.2. For a property

e_{i} \in E

, the polygon processing functions in geopandas, including GeoSeries.intersects, GeoSeries.distance, and other necessary functions, are utilized to produce the perpendicular distance between any property and hydrological or drainage infrastructure element

s_{i}

in

S

. To simplify the discussion, here we denote it with a

dis (e_{i}, s_{j})

, which produces the perpendicular distance between a property

e_{i}

and the

{Polygon}_{j}

of

s_{j}

. Let

η

be a threshold to control the search distance. Then we define a retrieval function

ret : E \to S_{j}

accordingly:

\begin{matrix} ret (e_{i}) = \{{Info}_{j} | dis (e_{i}, s_{j}) < = η\}, \\ {Info}_{j} is the unstructured information of entity s_{j} as defined in Equation (4) . \end{matrix}

(15)

Let ⊕ denote the text concatenation operator. Accordingly, for a property

e_{i}

, the retrieved information is given by the following:

D_{i} : = \oplus_{{Info}_{j} \in ret (e_{i})} {Info}_{j} .

(16)

The context generated by the Relevant Info Agent for property

e_{i}

is constructed by concatenating its coordinates, nearby hydrological and drainage infrastructure information, and its historical flood records (Inpractice, a fixed LLM is utilized to pre-process

Rela (e_{i})

for the construction context for downstream flood risk model. “Fixed” indicates that we use open-source LLM directly, without modifying its learned parameters. As the detailed architecture is relatively complex and not essential to the present discussion, we omit those details here; a more in-depth analysis is provided in Section 2.4.3. In our experiments, we adopt the base, unmodified versions of each model: the original LLaMA for Flood-LLaMA and the original Qwen for the Relevant Information Agent):

\begin{matrix} Rela (e_{i}) : = X_{i}^{co} \oplus D_{i} \oplus X_{i}^{re} . \\ X_{i}^{co} : coordinates information of e_{i} defined in Equation (2); \\ D_{i} : sum of hydrological or drainage infrastructure information nearby to e_{i}; \\ X_{i}^{re} : historical flood record information of e_{i} defined in Equation (5); \end{matrix}

(17)

The

Rela (e_{i})

is fed into the downstream LLM for neighbor-property information construction, providing contextual information about neighboring properties.

2.4.2. Div-Max Neighbor Info Agent

Our framework requires each property representation to encode not only its own attributes, but also the conditions of nearby parcels that shape its flood exposure. To meet this requirement, we introduce the Neighbor Info Agent, which collects geohydrological, drainage, and historical attributes from parcels surrounding a target property (see Figure 8). The agent operates through a graph-style recursive context aggregation mechanism: each property node integrates information from its adjacent neighbors and progressively incorporates contributions from higher-order neighbors. This process yields a neighborhood representation that captures broader spatial patterns relevant to inundation risk.

Specifically, for a property

e_{i}

, we define a neighborhood retrieval function

f_{neighbor} : E \to E^{*}

, which identifies adjacent properties using the GeoSeries.intersects function from the geopandas package. The resulting neighborhood set

N_{i} \subset E

is given by the following:

N_{i} : = f_{neighbor} (e_{i}) .

(18)

Although only adjacent neighbors of a specific property are included in its neighborhood set from a local perspective, a global perspective reveals that recursive multi-hop neighborhood connections can model long-term potential relationships between any two properties through this adjacency relation (i.e., any property in the city can be a multi-hop neighbor of a given property). This constructs a large-scale graph where nodes represent properties, and edges consist of both adjacency-based neighborhood relationships and paths along these neighborhood relations that connect any two properties within the city.

We term this urban simulation approach “graph-style recursive aggregation.” Beyond this neighborhood relation, no complex relationships were simulated. This is because the model may be applied in pre-expert scenarios, and constructing hydrologically meaningful relationships without bias or noise would be time-consuming to implement in such scenarios.

Our approach here is to provide graph-style recursive aggregation, which globally incorporates all possible paths between any two properties, and to leverage data to train the model to learn inherent patterns (e.g., upstream-downstream relationships and their roles across different flood types).

We then apply the context construction process defined in Equation (17) to generate contextual descriptions for each neighboring property. However, in dense urban environments, the resulting context can become excessively long, potentially distracting the downstream model and substantially increasing token consumption.

To mitigate this issue while preserving informative content, we introduce an embedding-based filtering strategy.

We apply a variant of LLM, the parameter-fixed LLM embedding model, expressed as

M^{'} (input_text) : = NN ({tokenizer}_{in} (input_text))

to generate vector representations of each neighbor’s context (Given thatthe detailed architecture is relatively complex and not essential to the present discussion; we also omit the specifics of this modification here. A more in-depth introduction is provided in Section 2.4.3. In fact,

M^{'}

is an LLM without the “tokenizer decoder” component and with fixed learnable parameters

W

(as defined in Equation (28) of Section 2.4.3)).

Let

d_{emb} \in N^{+}

denote the predefined embedding dimension. The parameter-fixed LLM embedding model is defined as a function that takes text—composed of token sequences—as input and outputs numeric vectorized representations of the context, i.e., the embeddings:

M^{'} : T^{*} \to R^{d_{emb}} .

(19)

Such vector representation is usually known as the “embeddings” of the input neighbor’s context in the LLM community.

For a neighbor

e_{j} \in N_{i}

, its embedding

H_{j} \in R^{d_{emb}}

is computed as follows:

H_{j} : = M^{'} (Rela (e_{j})) .

(20)

Specifically, these embeddings generated by the LLM exhibit a unique property: for text samples, the Euclidean distance between embeddings of semantically similar texts is consistently smaller than that between embeddings of semantically dissimilar texts. We leverage this property of the embeddings to reduce the context complexity of the target property while retaining contextual information from its neighbors that possess distinct characteristics. This is achieved by merging neighbors with similar contextual features.

In this process, the first step is to identify a representational variance set

{\hat{N}}_{i} \subseteq N_{i}

. This

{\hat{N}}_{i}

consists of neighbors whose embeddings exhibit maximal variance from one another, meaning they cannot be easily characterized as “similar to another neighbor already included in the set

{\hat{N}}_{i}

”:

\begin{matrix} {\hat{N}}_{i} & = arg max_{{\tilde{N}}_{i} \subseteq N_{i}} δ {({\tilde{N}}_{i})}^{2}, \\ δ ({\tilde{N}}_{i}) : = & E_{e_{j^{'}} \in {\tilde{N}}_{i}} {∥H_{j^{'}} - E_{e_{j} \in {\tilde{N}}_{i}} H_{j}∥}^{2}, \\ s . t . & | {\tilde{N}}_{i} | \leq k . \end{matrix}

(21)

For the remaining neighbors

e_{j} \in N_{i} ∖ {\hat{N}}_{i}

, we use embedding similarity, measured via Euclidean distance, to determine which neighbor in

{\hat{N}}_{i}

they are most similar to

f_{sim} (e_{j}) : = arg min_{e_{j^{'}} \in {\hat{N}}_{i}} {∥ H_{j} - H_{j^{'}} ∥}^{2},

(22)

We then merge the long description of these neighbors

e_{j} \in N_{i} ∖ {\hat{N}}_{i}

using a concise description: “

e_{j}

is similar to

f_{sim} (e_{j})

.” Thus we construct the modified context

C_{j}

for each neighbor

e_{j} \in N_{i}

as follows:

C_{j} = \{\begin{matrix} Rela (e_{j}), & if e_{j} \in {\hat{N}}_{i}, \\ “ e_{j} is similar to f_{sim} (e_{j}) ”, & if e_{j} \notin {\hat{N}}_{i} . \end{matrix}

(23)

Finally, we define the aggregated context by the Div-max Neighbor Info Agent for the target property

e_{i}

as follows:

Dmax (e_{i}) = ⨁_{e_{j} \in N_{i}} C_{j},

(24)

This context

Dmax (e_{i})

serves as a compressed yet information-rich representation of the neighborhood of the target property

e_{i}

, and enables the downstream model to relate

e_{i}

to any other property in the city via graph-style recursive neighbor aggregation.

2.4.3. Learnable Estimation Agent

Unlike traditional machine learning methods, large language models (LLMs) excel at integrating information from heterogeneous and complex domains. Although they differ in architectural details, most LLMs consist of two main components: fixed tokenization functions that map input text to vector representations and convert vectors back into text (i.e., the tokenizer encoder and decoder), and a neural network with learnable parameters. During inference, input text is represented and processed by the LLM as a sequence of vectors (see Figure 9).

Let

d_{in} \in N^{+}

and

d_{out} \in N^{+}

denote the predefined input and output vector dimensions of the LLM, respectively. The tokenizer encoder is defined as

{tokenizer}_{in} : T^{*} \to {(R^{d_{in}})}^{*} .

(25)

Similarly, the tokenizer decoder is defined as

{tokenizer}_{out} : {(R^{d_{out}})}^{*} \to T^{*} .

(26)

The tokenizer encoder functions implement fixed mappings that encode input text into real-valued vectors, which can be processed by a downstream neural network, and decode the resulting vector representations back into text (There are many different tokenizer designs. Each open-source LLM is released with its own corresponding tokenizer. As this paper does not address modifications to the tokenizer component, the detailed design of the tokenizer is omitted).

The learning capability of an LLM is determined by its neural network component, whose parameters can be adjusted to approximate desired output patterns and generate appropriate responses. Although the architectural designs of neural networks vary across different LLMs, they can all be expressed as compositions of parameterized functions with distinct learnable parameters. Specifically, these neural networks are constructed by stacking basic functional units, commonly referred to as “layers”.

To simplify the discussion, let

ℓ \in N^{+}

denote the total number of layers in the neural network, and

l \in [1, ℓ]

index a specific layer. We then denote the parameter set

{\{W^{l}\}}_{l = 1}^{ℓ}

with a single matrix

W \in R^{α \times β}

that aggregates all learnable parameters in the parameter set, where Let

α \in N^{+}

and

β \in N^{+}

denote the dimension of large enough to contain all

W^{l}

in

W

and rewrite the neural network as

{NN}_{W} (\cdot) = {layer}_{W^{ℓ}}^{ℓ} (\dots {layer}_{W^{2}}^{2} ({layer}_{W^{1}}^{1} (\cdot)) \dots) .

(27)

Finally, an LLM with learnable parameters

W

can be expressed as

M_{W} (input_text) = {tokenizer}_{out} ({NN}_{W} ({tokenizer}_{in} (input_text))) .

(28)

Let

p r o m p t_{estimate} \in T^{*}

denote a prompt that guides the LLM to estimate flood risk levels based on the combined property context

Rela (e_{i}) \oplus Dmax (e_{i})

, and return the output in a structured format (e.g., placing the result between the delimiters “$[” and “]$”). The generated textual output for property

e_{i}

is then given by

O_{i} (W) : = M_{W} (p r o m p t_{estimate} \oplus Rela (e_{i}) \oplus Dmax (e_{i})) .

(29)

Although LLMs perform well on general tasks, domain-specific applications such as property-level flood risk estimation continue to require alignment with expert reasoning processes. However, data collected from human experts typically contains only final estimation results, without the intermediate reasoning or chain of thought.

To fine-tune the LLM using only these ground-truth labels, we define a mask function

f_{mask}

that extracts the estimated risk vector from the model output using the format enforced by

p r o m p t_{estimate}

:

f_{mask} (s) = \{\begin{matrix} s^{'}, & \exists s^{'} \in {0, 1, 2, 3}^{4} between “ $ [” and “] $ ” in s; \\ None, & otherwise . \end{matrix}

(30)

Equation (30) means that only outputs that satisfy the constraints will be utilized, while the others will be discarded to facilitate automatic extraction by the program (In fact, the

s^{'}

in Equation (30) is converted from a string to an integer array using standard string-conversion functions that are widely available in most programming languages, as discussed below Equation (2). This design is simplified here because such processing is common in computer algorithms and does not affect the discussion). Specifically, the constraint in Equation (30) consists of two parts. First, the constraint that only outputs between “[” and “]” will be extracted, ensuring that only outputs following the format enforced by

p r o m p t_{estimate}

are considered. Second, the constraint that extraction occurs only when

s^{'} \in {0, 1, 2, 3}^{4}

ensures that the output contains valid predicted flood levels consistent with the format of expert predictions

y_{i}

, as defined in Equation (6). Thus, we can extract the four integers from the output of

f_{mask} (O_{i} (W))

:

\begin{matrix} [{pred}_{i}^{Creek}, {pred}_{i}^{River}, {pred}_{i}^{Storm}, {pred}_{i}^{Overland}] : = f_{mask} (O_{i} (W)), \\ {pred}_{i}^{Creek} \in {0, 1, 2, 3} : LLM predicted Creek flooding level of e_{i}, \\ {pred}_{i}^{River} \in {0, 1, 2, 3} : LLM predicted River flooding level of e_{i}, \\ {pred}_{i}^{Storm} \in {0, 1, 2, 3} : LLM predicted Storm tide flooding level of e_{i}, \\ {pred}_{i}^{Overland} \in {0, 1, 2, 3} : LLM predicted Overland flow level of e_{i} . \end{matrix}

(31)

For each property

e_{i}

, we apply a variant of the norm function in Equation (13) that is augmented to handle None values as scores:

\begin{matrix} P_{i} (W) = \{\begin{matrix} - 1 \times ∥y_{i} - f_{mask} (O_{i} (W))∥, & f_{mask} (O_{i} (W)) is not None; \\ 0, & otherwise . \end{matrix} \end{matrix}

(32)

The value of

P_{i}

reflects the model’s confidence and accuracy: higher values indicate closer alignment with the expert label, while lower values suggest greater discrepancy. Then applicable variation of optimization (13) for fine-tuning Flood-risk estimation LLM

f_{mask} (M_{W} (p r o m p t_{estimate} \oplus Rela (e_{j}) \oplus Dmax (e_{j}))

can be denoted by optimization:

\hat{W} = arg max_{W} \sum_{(e_{i}, y_{i}) \in Y_{train}} P_{i} (W) .

(33)

Specifically, all open-source LLMs are pretrained by their providers. Let

\bar{W}

denote the pretrained parameters released by the LLM provider. The optimization problem in (33) can then be reformulated as learning a parameter modification

Δ W \in R^{α \times β}

applied to the pretrained parameters

\bar{W}

:

\begin{matrix} Δ W & = arg max_{Δ W^{'} \in R^{α \times β}} \sum_{(e_{i}, y_{i}) \in Y_{train}} P_{i} (W) = arg max_{Δ W^{'} \in R^{α \times β}} \sum_{(e_{i}, y_{i}) \in Y_{train}} P_{i} (\bar{W} + Δ W) . \end{matrix}

(34)

However, given the large number of learnable parameters,

Δ W

can be prohibitively large. As a result, directly optimizing (34) requires substantial GPU memory and training time ( Even for the “smaller” 3b-scale LLMs used in this study, directly optimizing (34) may require more than 32 GB of GPU memory). To mitigate this computational burden, we fine-tune the LLM using the Low-Rank Adaptation (LoRA) approach [35]. Under this approach, the update matrix

Δ W

is parameterized as the product of two low-rank matrices,

A \in R^{α \times γ}

and

B \in R^{γ \times β}

, where

γ ≪ min (α, β)

. Combining Equations (32)–(34), the Learnable Estimation Agent utilizing this approach can be denoted by

\begin{matrix} \hat{A}, \hat{B} = arg max_{\begin{matrix} A \in R^{α \times γ}, \\ B \in R^{γ \times β} \end{matrix}} \sum_{(e_{i}, y_{i}) \in Y_{train}} P_{i} (\bar{W} + A B) . \end{matrix}

(35)

In this approach, the dense matrix

Δ W

is approximated by the product of two much smaller matrices,

\hat{A}

and

\hat{B}

, which effectively reduces memory consumption to approximately one quarter of that required by the original approach.

Our Learnable Estimation Agent learns the final learnable parameters using this LoRA-based approach:

Learn (Y_{train}) : = \bar{W} + \hat{A} \hat{B} .

(36)

The output can be directly applied as the LLM parameters, enabling the LLM to predict flood levels by following the patterns learned from the samples in

Y_{train}

.

2.4.4. Overall Flood-LLM Framework

Combining the components described above, we propose the Flood-LLM framework to estimate the flood risk level

{\hat{y}}_{j}

for any property

e_{j} \in Y_{test}

, i.e., properties for which the flood levels are unknown:

\begin{matrix} {\hat{y}}_{j} = Flood - LLM (e_{j}) : = f_{mask} (M_{Learn (Y_{train})} (p r o m p t_{estimate} \oplus Rela (e_{j}) \oplus Dmax (e_{j})) . \end{matrix}

(37)

Equation (37) provides the implementation to generate

{\hat{y}}_{j}

for

e_{j}

, i.e., the solution for the formulated task in Equation (14), Section 2.3. The workflow is summarized in Algorithm 1 and Figure 1.

Algorithm 1 Learning procedure of Flood-LLM

Input: training set

Y_{train}

, pretrained LLM

M_{\bar{W}}

Parameters: number of training epochs T, prompt

p r o m p t_{estimate}

Output: fine-tuned LoRA parameters

\hat{A}

and

\hat{B}

1:: Randomly initialize A.
2:: Initialize $B \leftarrow 0$ .
3:: for each neighbor $e_{i} \in E$ do
4:: Generate $D_{i}$ according to Equation (16).
5:: Generate ${\hat{N}}_{i}$ and $N_{i} ∖ {\hat{N}}_{i}$ as Equation (21).
6:: end for
7:: for epoch $= 1$ to T do
8:: for each sample $(e_{i}, y_{i}) \in Y_{train}$ do
9:: Generate $Rela (e_{i})$ according to Equation (17).
10:: if $f_{neighbor} (e_{i}) \neq \emptyset$ then
11:: Generate modified context $C_{j}$ according to Equation (24).
12:: else
13:: continue
14:: end if
15:: Compute score $P_{i} (\bar{W} + A B)$ as in Equation (32).
16:: end for
17:: Optimize LoRA parameters $A, B$ by maximizing $\sum P_{i} (\bar{W} + A B)$ as in Equation (35).
18:: end for
19:: save $\hat{A} \leftarrow A$ and $\hat{B} \leftarrow B$
20:: return $\hat{A}$ and $\hat{B}$

Computational Complexity Analysis

In the training process, Step 5 of Algorithm 1 involves the repeated optimization of Equation (21). A concern is whether this optimization is excessively time-consuming. We analyze the computational complexity of solving the optimization problem in Equation (21), which aims to select the optimal neighbor subset for each property

e_{i}

. First, we clarify the values of key parameters involved in the complexity calculation: the dimension of the property representation vector H,

d_{emb}

, is set to 1024 in accordance with the suggestions of LLM providers (e.g., Qwen-3 and LLaMA-3.2); the average value of

| N_{i} | \approx 8

in the Brisbane; the size of the output optimal neighbor subset, k, is determined as 8 via parameter analysis (detailed in Section 3.5); the total number of properties in Brisbane, as described above, is

| E |

= 519,009. The optimization problem in Equation (21) is a subset selection task that maximizes the variance of neighbor representation vectors under the constraint

| {\tilde{N}}_{i} | \leq k

. We decompose its complexity for a single property

e_{i}

and then extend this analysis to the entire dataset.

For a single property

e_{i}

, the core computation consists of two parts: calculating the variance

δ ({\tilde{N}}_{i})

for a candidate subset

{\tilde{N}}_{i}

, and traversing valid candidate subsets to identify the optimal one. For the variance calculation of

δ ({\tilde{N}}_{i})

, for a subset

{\tilde{N}}_{i}

with size t (

t \leq k

), we first compute the mean representation of the t vectors (with a computational cost of

O (t \cdot d_{emb})

) and then calculate the squared Euclidean norm between each vector and this mean (with an additional cost of

O (t \cdot d_{emb})

). The total cost for one subset is

O (t \cdot d_{emb})

, and since

t \leq k

, this cost is bounded by

O (k \cdot d_{emb})

. To traverse the candidate subsets and find

arg {max}_{{\tilde{N}}_{i}} δ {({\tilde{N}}_{i})}^{2}

, we need to traverse all valid subsets of

N_{i}

with size

\leq k

. Given

| N_{i} | \approx 8

and

k = 8

, the computational complexity of this step is approximately 256, which is a relatively small constant (denoted as

C o n

) that is independent of the total number of properties

| E |

.

The overall complexity for the entire dataset can be derived by combining the two parts above. The computational cost for a single property

e_{i}

is

O (C o n + k \times d_{emb})

. Since

C o n

, k, and

d_{emb}

are all fixed parameters, we execute the above single-property computation independently for each

e_{i}

in the entire dataset with

| E |

properties. Given that the computational cost for each property

(C o n + k \times d_{emb}) ≪ | E |

(where 519,009 is approximately 61 times the value of

256 + 8 \times 1024

), the overall complexity of solving Equation (21) is:

O [| E | \times (C o n + k \times d_{emb})] = O (| E |)

(38)

Since the overall complexity is linear with respect to

| E |

(i.e.,

O (| E |)

), the computational overhead of solving Equation (21) is negligible in the entire pipeline and fully acceptable for large-scale datasets such as the Brisbane urban planning dataset.

Specifically, if we utilize this algorithm in cities with complex neighborhood relations, we can reduce this complexity by directly applying

{\hat{N}}_{i} = N_{i}

for entities with

| N_{i} | \leq k

, as these entities already possess small neighbor sets.

3. Results

3.1. Experimental Settings

Following the standard evaluation framework for classification [9,36], we employ the following metrics to assess predictive performance confusion matrix and accuracy.

Confusion matrix for multi-class classification (Table 2). A confusion matrix for multi-class classification is used to present the classification results by comparing the predicted labels against the actual labels. For a classification problem with $c_n$ classes, the matrix is of size $c_n \times c_n$ , where each row corresponds to the predicted class and each column corresponds to the actual class. The diagonal elements represent correct predictions, while off-diagonal elements indicate misclassifications.

In this matrix, each entry

{Conf}_{i j}

denotes the number of samples that belong to actual class j but were predicted as class i. For a specific class c:

–: True Positives for class c ( ${TP}_{c}$ ): the number of samples correctly predicted as class c, i.e., ${Conf}_{c c}$ .
–: False Positives for class c ( ${FP}_{c}$ ): the number of samples from other classes that were incorrectly predicted as class c, i.e., $\sum_{j \neq c} {Conf}_{c j}$ .
–: False Negatives for class c ( ${FN}_{c}$ ): the number of samples from class c that were incorrectly predicted as other classes, i.e., $\sum_{i \neq c} {Conf}_{i c}$ .
–: True Negatives for class c ( ${TN}_{c}$ ): the number of samples correctly predicted as not class c, i.e., all entries excluding row c and column c.

Accuracy. This metric judges the global alignment of the results [36] and is calculated as follows:

$Accuracy = \sum_{c}^{c_n} \frac{T P_{c} + T N_{c}}{T P_{c} + F P_{c} + T N_{c} + F N_{c}} .$

(39)
Level Accuracy (L.Acc). Specifically, to evaluate the model’s performance across different flood risk levels (Level 0–3), we compute the accuracy for each class c:

${L . Acc}_{c} = \frac{T P_{c} + T N_{c}}{T P_{c} + F P_{c} + T N_{c} + F N_{c}} .$

(40)

For the compared approaches, drawing on methods that have been widely adopted and empirically validated in the flood prediction literature, we select a set of representative machine learning and deep learning models for comparative evaluation. Support Vector Machines (SVM) and Random Forest (RF) are traditional machine learning methods that have been shown to effectively capture flood data patterns [9,26]. SVM works by finding the optimal hyperplane to separate data classes, while RF improves prediction accuracy through ensemble decision trees. Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) are among the top-performing deep learning methods for flood modeling [25]. MLP is a classic deep learning model with fully connected layers, whereas GCN leverages graph convolutions to incorporate neighboring information, enhancing its effectiveness.

Additionally, we employed open-source large language models (LLMs) without Flood-LLM fine-tuning as baseline comparisons, using LLaMA3.2-1B-Instruct [37] and Qwen3-1.7B-Instruct [38]. These models were chosen for their strong performance and widespread adoption while maintaining manageable computational cost. Given the shared transformer-based architecture of modern LLMs, experiments on these models are sufficient to demonstrate the compatibility of our framework with this class of models. Both models take inputs from the Relevant Info Agent and Div-max neighbor Info Agent to assess their performance in flood risk prediction.

We conduct the experiments on a high-performance system equipped with two NVIDIA A800 GPUs for efficient training and inference. The system is powered by a 32-core CPU and is equipped with 80 GB of RAM, ensuring smooth processing and fast computation for large-scale models and datasets.

The threshold

η

in Equation (15) is set to 100 m to limit the search scope, consistent with commonly adopted settings in prior studies [39,40,41]. The rank parameter

γ

in Equation (35) is set to 64. Matrix A is initialized with all-zero values, while matrix B is initialized using a Gaussian distribution. This initialization strategy follows the recommendations of the original LoRA paper [35] and is adopted to ensure stable and efficient optimization. For all methods, we employ the Adam optimizer [42], which is widely used and well established in the deep learning literature. For the base LLMs, we adopted the identical prompt as that used for our Flood-LLM, as these models consistently failed to generate valid final outputs without this prompt. Given that this study employs relatively small-scale LLMs, which are inherently limited by poor generalization capability, we prioritized fine-tuning and did not conduct additional prompt engineering experiments on the base models. All remaining hyperparameters and experimental settings strictly follow those reported in the respective original papers.

3.2. Overall Performance Comparison Across Models

Table 3 presents the overall accuracy of various models across four flood types.

Classical ML methods (SVM and Random Forest) and the MLP operate on vector-based property-level inputs, including property coordinates, elevation, and boolean-encoded flood history. Their performance is limited as these models rely solely on tabular representations and cannot explicitly capture spatial or relational dependencies, which are critical for flood risk estimation.

The GCN extends this representation by taking a graph-structured input, where each node corresponds to a property encoded by the same vector features (coordinates, elevation, and historical indicators), and edges model neighborhood relationships. This enables the incorporation of spatial context and leads to improved performance over the MLP. However, GCNs remain constrained by over-smoothing during multi-hop aggregation, which hampers their ability to represent complex and heterogeneous urban spatial patterns.

All LLM-based approaches use the same text-based property description as their explicit input. For each property, this description encodes property coordinates, elevation, and boolean flood history in natural language form. In addition, the Relevant Info Agent integrates heterogeneous geospatial, hydrological, and infrastructural data into coherent natural language inputs. The Div-max neighbor Info Agent enhances spatial reasoning by selecting semantically diverse and representative neighbors, preserving essential contextual information. This context supports the model’s chain-of-thought reasoning and final prediction.

Interestingly, despite this enriched contextual access, general-purpose LLMs (LLaMA3.2 and Qwen3) remain untrained on flood-risk data and consequently exhibit very poor performance. Detailed statistics on the valid output ratio and the conditional accuracy given valid outputs for base LLM approaches are reported in Table 4. These results reveal two critical limitations of directly applying base LLMs to the flood risk estimation task:

(1) The results indicate that, without fine-tuning, base 1B-scale LLMs struggle to produce valid outputs. The relatively low valid output ratio suggests that although base LLMs are capable of ingesting and processing complex textual inputs, they often fail to formulate outputs that conform to the required format or constraints. Even when the model internally arrives at a correct estimation, invalid outputs prevent automatic parsing and downstream utilization, thereby necessitating manual intervention. Such reliance on human post-processing is inefficient and incompatible with pre-expert flood risk estimation scenarios, where low human effort is a key requirement.

(2) The conditional accuracy given valid outputs reflects the correctness of predictions restricted to outputs that are syntactically and structurally valid. The results show that, even when base LLMs successfully generate valid outputs, their estimation accuracy remains limited. This deficiency primarily arises from the lack of domain-specific adaptation to flood risk patterns. Flood risk estimation is a highly specialized task that requires expert judgment to interpret spatial, environmental, and infrastructural information in relation to flood hazards. Such judgment is typically acquired through professional practice and accumulated empirical experience, rather than through the general linguistic and commonsense knowledge captured by large-scale pretraining corpora. Consequently, without domain-specific supervision, general-purpose LLMs lack the inductive bias necessary to align their reasoning with expert-informed flood risk assessment practices, leading to unreliable predictions.

In contrast, the proposed Flood-LLM framework achieves a notable performance improvement. The Learnable Estimation Agent fine-tunes the core LLM using limited expert-labeled data via the LoRA algorithm, enabling it to approximate expert decision-making while maintaining interpretability. This training process effectively encodes expert knowledge into the model, allowing it to internalize which patterns and attributes are considered relevant for flood risk estimation. As a result, Flood-LLM approximates expert decision-making in a scalable and interpretable manner, leading to consistently superior performance across all flood types.

3.3. Disaggregated Performance by Flood Presence and Severity

We visualize the flood risk estimation outputs of Flood-LLM using both the LLaMA3.2-1B-Instruct and Qwen3-1.7B-Instruct, and compare them with the official flood risk map published by the Brisbane City Council (see Figure 10). More detailed quantitative results, including the Affected Property Area (A.P.A) and the per-level accuracy (L.Acc), are provided in Table 5 (This study uses the lot area attribute from the property parcel dataset and aggregates it to quantify the Affected Property Area exposed to different flood types and risk levels). We visualize the flood risk estimation outputs of Flood-LLM using both the LLaMA3.2-1B-Instruct and Qwen3-1.7B-Instruct, and compare them with the official flood risk map published by the Brisbane City Council (see Figure 10). To further characterize model performance, we additionally examine the directional error structure using confusion matrices for each flood type (see Figure 11).

Taken together, the visual and quantitative results indicate that Flood-LLM achieves high reliability in identifying flood presence, while exhibiting a consistent tendency toward heavier classifications when differentiating flood risk levels.

In terms of binary flood presence, the predicted maps align closely with the official Council assessments, achieving high accuracy in distinguishing flooded from non-flooded areas across all flood types. This indicates that Flood-LLM effectively captures the primary spatial extent of flood exposure, even in complex urban settings. The robustness of this binary performance further suggests that the integration of heterogeneous contextual information and spatial neighborhood reasoning enables reliable identification of flood-affected areas at the city scale.

With respect to the distribution of flood risk levels, a systematic bias is observed across all flood types: lower risk levels (Levels 0–1) tend to be overestimated, whereas the highest risk level (Level 3) tends to be underestimated relative to the ground truth. This pattern indicates that both Flood-LLaMA and Flood-Qwen adopt a conservative severity allocation, inflating lower-risk cases while attenuating high-severity inundation and redistributing parcels toward intermediate risk tiers. As a result, the predicted flood extents exhibit a compressed risk spectrum that favors intermediate severity and reduces the spatial footprint of extreme flooding. By comparison, while the LLaMA-based model produces a level-wise property area distribution that appears slightly closer to the Council maps, it exhibits marginally lower accuracy than Qwen, particularly in distinguishing adjacent flood risk levels. This difference can be attributed to the inherent limitations of current large language models in mathematical reasoning and quantitative calibration, despite their strong capabilities in qualitative pattern recognition. Such behavior is consistent with well-documented weaknesses in mathematical and coding tasks.

3.4. Ablation Study

Our ablation study investigates whether the multi-domain contextual information integrated by the Relevant Information Agent significantly impacts flood risk estimation performance. As shown in Table 6, we observe three key findings:

First, the models successfully capture major flood threat patterns, with particularly strong performance on creek (C. Flood: 88.80–90.40%), river (R. Flood: 88.55–90.75%), and storm-tide floods (S. Flood: 91.75–95.45%). The performance degradation when removing specific contexts confirms their importance - hydrological removal reduces accuracy by 1.9–5.1% for waterway-related floods, while infrastructural removal causes the most significant drop for overland-flow floods (O. Flood: 6.7–7.5% decrease).

Second, the models demonstrate robust fallback capabilities, maintaining reasonable accuracy (all >73%) even when critical contexts are excluded. This suggests effective information redundancy, where remaining contexts can partially compensate for missing domains (e.g., historical data helping mitigate geospatial removal impacts).

Third, the complete Flood-LLM configuration consistently achieves optimal results across all scenarios, validating our multi-context integration approach. The performance advantage is most pronounced for complex flood types like overland-flow, where full-context Qwen3 outperforms its ablated versions by 6.5–14.0%.

3.5. Parameter Analysis

To examine the effect of the Div-max Neighbor Info Agent, we conduct a parameter analysis on the neighborhood size k in Equation (21) using the overland-flow flood dataset.

As described in Equation (23), smaller values of k provide less detailed neighborhood context, as most neighboring properties are summarized in the form “{this neighbor} is similar to {an already included neighbor}.” In particular,

k = 0

corresponds to the absence of neighbor information, while

k = 20

is excessively large for most properties, as it incorporates nearly all neighbors with minimal summarization.

The results are presented in Figure 12, from which three key observations can be drawn. First, incorporating neighborhood information consistently improves the performance of Flood-LLM. When neighbor information is removed, the accuracy of Flood-LLaMA and Flood-Qwen decreases by

2.91 %

and

2.53 %

, respectively, highlighting the importance of neighborhood context in flood risk estimation.

Second, when k is too small (e.g.,

k = 4

), performance remains limited. This is likely because excessive summarization leads to substantial information loss, which adversely affects flood risk estimation.

Third, increasing k beyond an optimal range does not yield further performance gains. This suggests that essential neighborhood information is already captured, and adding additional neighbors may instead introduce noise or distract the LLM.

Based on these findings, we set

k = 8

in our implementation, as it achieves the best overall performance in this experiment and is therefore adopted for comparison with other settings.

3.6. Analysis of LLM Reasoning Processes

We also conduct several case studies to analyze the prediction workflow of the LLM. In Table 7 and the Box 1, we present the facts and chain-of-thought for a specific property generated by the Qwen model after SFT. In this example, although the model has not accessed any human-reasoning process, it systematically and correctly identifies the key factors needed to estimate the risk associated with each type of flood and derives the final result accordingly. Such chain-of-thought reasoning helps domain experts understand the LLM’s decision-making workflow, making the model’s behavior safer and more transparent.

Box 1. Chain-of-thought interpretation by Flood-Qwen3 for property 13 * 46

Here’s the reasoning behind the flood risk assessment for each type: Creek flood (0): No open waterways or gullies within 100 m of the target property. Neighboring properties with gullies have proper drainage pipes, reducing overflow risk. River flood (3): The property was flooded in 1974, 2011, and 2022, indicating a recurring high-risk pattern. Multiple neighboring properties at the same elevation (7.13 m AHD) also experienced these floods. The elevation (7.13 m) is likely within a major river’s floodplain. Storm tide flood (0): The property is ∼20.4 km from the coastline, far enough to avoid storm surges. No historical records suggest storm tide flooding. Overflow flood (2): No major gullies or drainage issues directly near the property. However, past floods (1974, 2011, 2022) suggest drainage systems may be overwhelmed during extreme rainfall. Some neighbors have gullies with drainage pipes, but the target property itself lacks direct drainage infrastructure. Final Output: [0, 2, 0, 1]

4. Discussion

4.1. Multi-Domain Integration and the Complexity of Urban Flood Risk

The ablation experiments demonstrate that combining geographical, hydrological, drainage-infrastructure, and historical flood information substantially enhances predictive performance. In each variant where one data domain is removed, the model exhibits a notable reduction in accuracy, indicating that urban flood risk emerges from the interaction of multiple physical and infrastructural factors rather than from single-variable determinants. This finding reflects the real spatial complexity of urban environments, in which elevation, runoff pathways, pipe capacity, soil permeability, and past flood behaviors influence one another. By converting heterogeneous inputs into a unified textual representation, Flood-LLM can associate diverse information sources in a manner that resembles aspects of expert flood assessment, suggesting a possible way to address data-integration challenges in flood risk analysis.

4.2. Neighborhood Context and the Identification of Localized Vulnerabilities

The parameter analysis further shows that the incorporation of spatially diverse neighboring properties significantly improves the model’s ability to detect localized flood risks. Urban flooding often depends on micro-topographic variations, subtle drainage connections, and small-scale runoff patterns that are not adequately captured when using only the target property’s attributes. The improvements introduced by the Div-max Neighbor Info Agent indicate that effective flood prediction requires not only spatial proximity but also contextual diversity. This capability reflects how hydrological practitioners interpret flood behavior: properties that appear similar in elevation or land use may experience different hazards depending on upstream flow paths, pipe networks, or historical recurrence patterns. The findings therefore, highlight the importance of contextual reasoning in enhancing the accuracy of risk assessments in heterogeneous urban settings.

4.3. Prospects and Reflections on the Framework’s Methodological Potential

This study shows that Flood-LLM achieves effective learning from heterogeneous urban inputs and produces observable reasoning patterns during prediction that align with expert-derived flood risk labels. These findings point to two prospective methodological strengths of the framework. First, the training paradigm, which associates diverse urban indicators with planning labels, suggests the possibility of adapting the approach to different planning contexts through retraining on locally defined labeling schemes. Second, the use of transparent reasoning chains may support interpretation and oversight by making visible how the model connects terrain conditions, infrastructure indicators, waterway proximity, and historical flood information when forming predictions. In any potential future application context where model outputs might diverge from expert judgment, these reasoning traces could provide a basis for examining which spatial cues were emphasized, potentially assisting experts in identifying whether discrepancies arise from data limitations, contextual ambiguity, or model bias rather than opaque computational processes.

However, it is important to emphasize that these considerations remain exploratory and relate to possible future developments rather than present-day applications. The findings demonstrate the framework’s methodological potential, while any practical relevance would depend on further technical refinement together with careful development of appropriate governance, accountability, and ethical frameworks.

4.4. Opportunities and Limitations of LLM-Based Flood Risk Models

Despite the model’s overall performance advantage over RF, MLP, and GCN baselines, the confusion matrix reveals misclassifications in intermediate risk categories. Flood-LLM is more reliable in distinguishing flood presence versus absence than in differentiating fine-grained risk levels, particularly for properties whose outcomes depend on sensitive hydrodynamic behaviors or infrastructure thresholds not fully captured in text. This limitation reflects a broader challenge in applying LLMs to flood prediction: models excel at synthesizing multi-domain qualitative information but may lack the numerical precision required for borderline distinctions. A promising direction for future work is therefore to decouple the prediction task by first focusing on flooded-area identification, followed by targeted optimization for risk level classification within flooded regions.

The confusion matrix also suggests a tendency toward relatively conservative classifications, indicating a general shift away from extreme categories toward intermediate ones. This pattern reflects a degree of imprecision in how the model currently associates spatial cues with planning labels. If applied in practical contexts, such imprecision could influence outcomes, and therefore, future work would need to focus on calibration strategies and threshold design to further refine the model’s behavior.

Several methodological limitations identified in this study point to opportunities for further refinement:

1.: The flood risk labels adopted for model training suffer from inherent limitations arising from heterogeneous sources and asynchronous update cycles. Overfitting to such labels may lead the model to learn spurious spatial dependencies, implying that future research could benefit from labeling sources with higher internal consistency and synchronous updating.
2.: The merging of low and very-low risk categories reduces label granularity, suggesting that future work may benefit from retaining finer label distinctions to improve sensitivity to subtle spatial variations in flood exposure.
3.: Although training and testing labels were spatially separated, adjacent areas still share a continuous spatial context observable to the model, suggesting that future studies may explore more spatially robust evaluation strategies to better isolate potential spatial dependence effects.
4.: The use of a single fixed buffer distance to characterize drainage proximity may overlook variations in how drainage configurations influence parcel-level flood exposure, suggesting that future work may benefit from exploring multiple or adaptive spatial extents to more accurately capture how drainage configurations relate to parcel-level risk patterns.
5.: To limit computational cost, we used relatively small-scale LLMs with restricted generalization capacity, focusing on fine-tuning rather than prompt engineering. Future work with larger or more capable models may explore prompt-based improvements beyond the scope of this study.
6.: Despite promising performance, LLMs remain inferior to human experts in reasoning, error avoidance, complex knowledge use, and accountability. This study presents a research direction rather than a mature solution. Flood-LLM may still generate erroneous inferences, such as spurious spatial dependencies, and requires further refinement before it can approach practical use.

In addition, several directions for future development emerge from this work. Integrating additional multi-modal data sources, such as remote sensing imagery, rainfall radar observations, and outputs from physics-based hydrodynamic simulations, may enhance the model’s sensitivity to fine-scale physical processes. Extending the framework to incorporate temporal dynamics could enable analyses under evolving climatic or infrastructural conditions. Further evaluation is also needed to assess the robustness of the approach in cities with sparse drainage data or highly irregular terrain. Incorporating basic hydrologically relevant datasets, such as drainage network hydraulics, may also help the model better capture flow-related spatial patterns and reduce systematic biases in level estimation.

5. Conclusions

This study presents Flood-LLM, a multi-agent large language model framework for exploring how heterogeneous urban data can be translated into parcel-level flood risk estimations through structured narrative reasoning. The Relevant Info Agent organizes parcel-level geospatial, elevation, drainage, waterway, and historical flood information into structured descriptions. The Div-Max Neighbor Info Agent extends this representation by identifying relevant neighboring parcels and constructing a broader neighborhood-scale spatial narrative. The Learnable Estimation Agent, supported by LoRA-based fine-tuning, then relates the combined parcel- and neighborhood-level narratives to ordered risk categories in an interpretable manner. When applied to Brisbane, the results show that the framework can approximate expert-derived spatial risk patterns while making visible how different spatial cues contribute to predictions.

Future research may also extend the framework by incorporating additional multi-modal and temporally dynamic data, such as remote sensing imagery, rainfall radar observations, hydrodynamic simulation outputs, and drainage hydraulics, as well as by examining its behavior in cities with different data availability and terrain conditions.

In conclusion, this research suggests that large language models may offer a way to interpret complex urban spatial information for flood risk estimation, particularly in contexts where detailed hydraulic modeling is not readily available. Flood-LLM illustrates how such AI-based approaches can be structured around transparent reasoning and heterogeneous spatial data. The framework still requires further refinement and optimization, and any potential future application would need to give careful consideration to transparency, accountability, and ethical governance.

Author Contributions

Conceptualization, Y.W. and J.J.; methodology, Y.W. and J.J.; software, Y.W. and J.J.; validation, Y.W.; formal analysis, Y.W.; investigation, Y.W. and J.J.; resources, J.J. and M.M.; data curation, Y.W., J.J. and M.M.; writing—original draft preparation, Y.W. and J.J.; writing—review and editing, Y.W. and J.J.; visualization, Y.W. and J.J.; supervision, M.M.; project administration, M.M.; funding acquisition, Y.W., J.J. and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Scholarship Council, No. 202006130005, No. 202006020044.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The geospatial and infrastructural datasets used in this study are derived from publicly accessible sources provided by the Brisbane City Council Open Data Portal at https://data.brisbane.qld.gov.au/pages/home/ (accessed on 16 July 2025) and the Queensland Spatial Catalogue at https://qldspatial.information.qld.gov.au/catalogue/custom/index.page (accessed on 16 July 2025). The processed datasets generated during the current study are available from the corresponding author upon reasonable request. The source code for implementing the Flood-LLM framework is openly available at: https://github.com/Super-E-Fee/Flood-LLM (accessed on 16 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Challies, E.; Newig, J.; Thaler, T.; Kochskämper, E.; Levin-Keitel, M. Participatory and collaborative governance for sustainable flood risk management: An emerging research agenda. Environ. Sci. Policy 2016, 55, 275–280. [Google Scholar] [CrossRef]
Wang, L.; Cui, S.; Li, Y.; Huang, H.; Manandhar, B.; Nitivattananon, V.; Fang, X.; Huang, W. A review of the flood management: From flood control to flood resilience. Heliyon 2022, 8, e11763. [Google Scholar] [CrossRef]
Trinh, M.X.; Molkenthin, F. Flood hazard mapping for data-scarce and ungauged coastal river basins using advanced hydrodynamic models, high temporal-spatial resolution remote sensing precipitation data, and satellite imageries. Nat. Hazards 2021, 109, 441–469. [Google Scholar] [CrossRef]
Gnecco, I.; Pirlone, F.; Spadaro, I.; Bruno, F.; Lobascio, M.C.; Sposito, S.; Pezzagno, M.; Palla, A. Participatory mapping for enhancing flood risk resilient and sustainable urban drainage: A collaborative approach for the Genoa case study. Sustainability 2024, 16, 1936. [Google Scholar] [CrossRef]
Mees, H.; Crabbé, A.; Alexander, M.; Kaufmann, M.; Bruzzone, S.; Lévy, L.; Lewandowski, J. Coproducing flood risk management through citizen involvement: Insights from cross-country comparison in Europe. Ecol. Soc. 2016, 21, 7. [Google Scholar] [CrossRef]
Wheater, H. Progress in and prospects for fluvial flood modelling. Philos. Trans. R. Soc. London Ser. A Math. Phys. Eng. Sci. 2002, 360, 1409–1431. [Google Scholar] [CrossRef]
Giustarini, L.; Chini, M.; Hostache, R.; Pappenberger, F.; Matgen, P. Flood hazard mapping combining hydrodynamic modeling and multi annual. Remote Sens. Data Remote Sens. 2015, 7, 14200–14226. [Google Scholar]
Boulaire, F.A.; Cook, S.; Fleming, A.; Romanach, L.; Capon, T.; Po, M.; Darbyshire, R.; Barnett, G.; Bluhm, S.; Lin, B.B. Insights on the process to develop Australia’s first national climate risk assessment. iScience 2025, 28, 112068. [Google Scholar] [CrossRef]
Antzoulatos, G.; Kouloglou, I.O.; Bakratsas, M.; Moumtzidou, A.; Gialampoukidis, I.; Karakostas, A.; Lombardo, F.; Fiorin, R.; Norbiato, D.; Ferri, M.; et al. Flood hazard and risk mapping by applying an explainable machine learning framework using satellite imagery and GIS data. Sustainability 2022, 14, 3251. [Google Scholar] [CrossRef]
Muñoz, D.F.; Muñoz, P.; Moftakhari, H.; Moradkhani, H. From local to regional compound flood mapping with deep learning and data fusion techniques. Sci. Total Environ. 2021, 782, 146927. [Google Scholar] [CrossRef]
Hofmann, J.; Schüttrumpf, H. Floodgan: Using deep adversarial learning to predict pluvial flooding in real time. Water 2021, 13, 2255. [Google Scholar] [CrossRef]
Yang, F.; Ding, W.; Zhao, J.; Song, L.; Yang, D.; Li, X. Rapid urban flood inundation forecasting using a physics-informed deep learning approach. J. Hydrol. 2024, 643, 131998. [Google Scholar] [CrossRef]
Lee, C.C.; Huang, L.; Antolini, F.; Garcia, M.; Juan, A.; Brody, S.D.; Mostafavi, A. Predicting peak inundation depths with a physics informed machine learning model. Sci. Rep. 2024, 14, 14826. [Google Scholar] [CrossRef]
Yin, K.; Mostafavi, A. Unsupervised graph deep learning reveals emergent flood risk profile of urban areas. arXiv 2023, arXiv:2309.14610. [Google Scholar] [CrossRef]
Liu, C.; Mostafavi, A. Floodgenome: Interpretable machine learning for decoding features shaping property flood risk predisposition in cities. Environ. Res. Infrastruct. Sustain. 2025, 5, 015018. [Google Scholar] [CrossRef]
Yokoya, N.; Yamanoi, K.; He, W.; Baier, G.; Adriano, B.; Miura, H.; Oishi, S. Breaking the limits of remote sensing by simulation and deep learning for flood and debris flow mapping. arXiv 2020, arXiv:2006.05180. [Google Scholar] [CrossRef]
Moshe, Z.; Metzger, A.; Elidan, G.; Kratzert, F.; Nevo, S.; El-Yaniv, R. Hydronets: Leveraging river structure for hydrologic modeling. arXiv 2020, arXiv:2007.00595. [Google Scholar] [CrossRef]
Guo, Z.; Leitao, J.P.; Simões, N.E.; Moosavi, V. Data-driven flood emulation: Speeding up urban flood predictions by deep convolutional neural networks. J. Flood Risk Manag. 2021, 14, e12684. [Google Scholar] [CrossRef]
Cho, M.; Kim, C.; Jung, K.; Jung, H. Water level prediction model applying a long short-term memory (lstm)–gated recurrent unit (gru) method for flood prediction. Water 2022, 14, 2221. [Google Scholar] [CrossRef]
Liu, B.; Tang, Q.; Zhao, G.; Gao, L.; Shen, C.; Pan, B. Physics-guided long short-term memory network for streamflow and flood simulations in the Lancang–Mekong river basin. Water 2022, 14, 1429. [Google Scholar] [CrossRef]
Zhou, Q.; Teng, S.; Situ, Z.; Liao, X.; Feng, J.; Chen, G.; Zhang, J.; Lu, Z. A deep-learning-technique-based data-driven model for accurate and rapid flood predictions in temporal and spatial dimensions. Hydrol. Earth Syst. Sci. 2023, 27, 1791–1808. [Google Scholar] [CrossRef]
Pianforini, M.; Dazzi, S.; Pilzer, A.; Vacondio, R. A deep learning model for real-time forecasting of 2-D river flood inundation maps. Hydrol. Earth Syst. Sci. Discuss. 2024, 2024, 1–44. [Google Scholar]
Wang, Y.; Zhang, P.; Xie, Y.; Chen, L.; Li, Y. Toward explainable flood risk prediction: Integrating a novel hybrid machine learning model. Sustain. Cities Soc. 2025, 120, 106140. [Google Scholar] [CrossRef]
Ryd, E.; Nearing, G. Fine Flood Forecasts: Incorporating local data into global models through fine-tuning. arXiv 2025, arXiv:2504.12559. [Google Scholar] [CrossRef]
Shu, Y.; Zheng, G.; Yan, X. Application of Multiple Geographical Units Convolutional Neural Network based on neighborhood effects in urban waterlogging risk assessment in the city of Guangzhou, China. Phys. Chem. Earth Parts A/B/C 2022, 126, 103054. [Google Scholar] [CrossRef]
Wei, Q.; Zhang, H.; Chen, Y.; Xie, Y.; Yin, H.; Xu, Z. City scale urban flooding risk assessment using multi-source data and machine learning approach. J. Ournal Hydrol. 2025, 651, 132626. [Google Scholar] [CrossRef]
Mehmood, H. Leveraging large language models for floods mapping and advanced spatial decision support: A user-friendly approach with SATGPT. ITU J. ICT Discov. 2025, 6, 57–66. [Google Scholar] [CrossRef]
Brisbane City Council. Flooding in Brisbane: A Guide for Residents; Online report; Brisbane City Council: Brisbane, Australia, 2024.
Bentivoglio, R.; Isufi, E.; Jonkman, S.N.; Taormina, R. Deep learning methods for flood mapping: A review of existing applications and future research directions. Hydrol. Earth Syst. Sci. Discuss. 2022, 2022, 4345–4378. [Google Scholar] [CrossRef]
Brisbane City Council. Understanding Flood Likelihood and Impact; Online report; Brisbane City Council: Brisbane, Australia, 2025.
Queensland Audit Office. Brisbane River Strategic Floodplain Management Plan; Online report; Queensland Audit Office: Brisbane, Australia, 2019.
Queensland Audit Office. Flood Resilience of River Catchments; Online report; Queensland Audit Office: Brisbane, Australia, 2025.
Riche, A.; Drias, A.; Guermoui, M.; Gherib, T.; Boulmaiz, T.; Souissi, B.; Melgani, F. A novel hybrid deep-learning approach for flood-susceptibility mapping. Remote Sens. 2024, 16, 3673. [Google Scholar] [CrossRef]
Jordahl, K.; den Bossche, J.V.; Fleischmann, M.; Wasserman, J.; McBride, J.; Gerard, J.; Tratner, J.; Perry, M.; Badaracco, A.G.; Farmer, C.; et al. geopandas/geopandas: v0.8.1. Zenodo 2020. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Meta, A. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta Blog. Retrieved Dec. 2024, 20, 2024. [Google Scholar]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Slough Borough Council. Strategic Flood Risk Assessment; Online report; Slough Borough Council: Slough, UK, 2007.
Innocent, E.; Ogedegbe, S. Geospatial analysis of flood problems in jimeta riverine community of adamawa state, Nigeria. J. Environ. Earth Sci. 2015, 5, 32–45. [Google Scholar]
Duran, E.; Demir, I. Enhancing the Resilience of Wind Energy Infrastructure in Iowa: Flood Risk Assessment and Site Suitability Analysis for Critical Infrastructure Protection. Int. J. Disaster Risk Reduct. 2026, 133, 106003. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]

Figure 1. The overall framework and key components of Flood-LLM. LLM stands for large language model. LoRA A and LoRA B represent the low-rank adaptation matrices in the LoRA-based parameter-efficient fine-tuning module.

Figure 2. Satellite image and maps of property parcels in Brisbane. (a) Satellite image. (b) Property map. The red lines indicate the area of Brisbane covered in this study (based on the Brisbane Suburb Boundaries datasets released by BCC).

Figure 3. Maps of elevation and hydrological data in Brisbane. (a) Elevation map. (b) Waterway map.

Figure 4. Maps of drainage infrastructure data in Brisbane. (a) Map of pipes. (b) Map of gullies.

Figure 5. Maps of historical flood records in Brisbane. (a) Flood record: 1974. (b) Flood record: 2011. (c) Flood record: 2022.

Figure 6. Classification statistics of property counts by flood risk level based on the Brisbane City Council ground truth.

Figure 7. Structure of relevant info agent (LLM: large language model).

Figure 8. Structure of Div-max neighbor info agent (LLM: large language model).

Figure 9. Structure of learnable estimation agent (LLM: large language model).

Figure 10. Visualization of estimated risk maps alongside the council-provided expert assessment (top left). (a) Creek flood risk map showing LLM-based predictions. (b) River flood risk map showing LLM-based predictions. (c) Storm-tide risk map showing LLM-based predictions. (d) Overland-flow risk map showing LLM-based predictions.

Figure 11. Confusion matrices of Flood-LLMs on different flood types.

Figure 12. Parameter analysis on O. Flood. (a) Flood-LLaMA. (b) Flood-Qwen.

Table 1. Summary of representative ML-based Flood studies.

	Property-Detail	Relevant Information			Neighborhood	Reasoning
		Geometry	Hydrological	Infrastructural
(Yokoya et al., 2020) [16]	×	✓	×	×	×	×
(Moshe et al., 2020) [17]	×	✓	✓	×	✓	×
(Guo et al., 2021) [18]	×	✓	×	✓	✓	×
(Muñoz et al., 2021) [10]	×	✓	×	×	×	×
(Hofmann and Schüttrumpf, 2021) [11]	×	✓	×	✓	✓	×
(Cho et al., 2022) [19]	×	✓	✓	×	×	×
(Liu et al., 2022) [20]	×	✓	×	✓	✓	×
(Antzoulatos et al., 2022) [9]	×	✓	×	×	✓	×
(Yin and Mostafavi, 2023) [14]	×	✓	✓	×	✓	×
(Zhou et al., 2023) [21]	×	✓	×	×	✓	×
(Yang et al., 2024) [12]	×	✓	✓	✓	×	×
(Pianforini et al., 2024) [22]	×	✓	✓	×	×	×
(Lee et al., 2024) [13]	×	✓	×	✓	✓	×
(Liu and Mostafavi, 2025) [15]	×	✓	✓	×	✓	×
(Wang et al., 2025) [23]	×	✓	✓	×	✓	×
(Ryd and Nearing, 2025) [24]	×	✓	✓	×	×	×
(Shu et al., 2022) [25]	✓	✓	×	✓	✓	×
(Wei et al., 2025) [26]	✓	✓	×	✓	✓	×
Flood-LLM	✓	✓	✓	✓	✓	✓

Table 2. Multi-class confusion matrix representation.

	Actual Class 0	Actual Class 1	⋯	Actual Class $c_n - 1$
Predicted Class 0	${Conf}_{00}$	${Conf}_{01}$	⋯	${Conf}_{0 (c_n - 1)}$
Predicted Class 1	${Conf}_{10}$	${Conf}_{11}$	⋯	${Conf}_{1 (c_n - 1)}$
⋮	⋮	⋮	⋱	⋮
Predicted Class $c_n - 1$	${Conf}_{(c_n - 1) 0}$	${Conf}_{(c_n - 1) 1}$	⋯	${Conf}_{(c_n - 1) (c_n - 1)}$

Table 3. Overall accuracy comparison on Brisbane flood risk map data (%) (C. Flood: Creek flood, R. Flood: River flood, S. Flood: Storm tide flood, O. Flood: Overland flow flood; LLaMA3.2: LLaMA3.2-1B-Instruct, Qwen3: Qwen3-1.7B-Instruct).

Flood Type	Classical ML		DL Approaches		LLM Approaches		Flood-LLM
	SVM	RF	MLP	GCN	LLaMA3.2	Qwen3	+LLaMA3.2	+Qwen3
C. Flood	$66.74 \pm 3.17$	$68.70 \pm 3.80$	$72.32 \pm 4.54$	$80.70 \pm 4.07$	$4.84 \pm 0.00$	$5.70 \pm 0.00$	$88.80 \pm 3.55$	$90.40 \pm 3.64$
R. Flood	$56.11 \pm 4.51$	$60.86 \pm 4.06$	$60.15 \pm 4.68$	$83.59 \pm 4.35$	$5.10 \pm 0.01$	$6.15 \pm 0.00$	$90.75 \pm 4.65$	$88.55 \pm 4.86$
S. Flood	$73.41 \pm 5.28$	$78.55 \pm 5.94$	$75.33 \pm 5.43$	$80.75 \pm 5.61$	$3.45 \pm 0.00$	$7.59 \pm 0.00$	$91.75 \pm 5.34$	$95.45 \pm 6.40$
O. Flood	$53.15 \pm 2.06$	$55.95 \pm 1.85$	$55.08 \pm 1.00$	$76.48 \pm 3.55$	$2.05 \pm 0.00$	$6.04 \pm 0.00$	$80.55 \pm 2.36$	$87.60 \pm 2.40$

Table 4. Valid output ratio and conditional accuracy given valid outputs of base LLM approaches (Val.Out.: Valid Output ratio, Con.Acc.: Conditional Accuracy given valid outputs).

Flood Type	Base LLaMA3.2		Base Qwen3
	Val.Out.	Con.Acc.	Val.Out.	Con.Acc.
C. Flood	$16.01$	$30.23$	$19.01$	$29.84$
R. Flood	$16.11$	$31.64$	$20.03$	$30.71$
S. Flood	$10.46$	$32.98$	$25.17$	$30.15$
O. Flood	$8.14$	$25.17$	$20.11$	$30.04$

Table 5. Affected property area (A.P.A.) and level accuracy (L.Acc) comparison between two prediction results and the council flood risk map.

Flood Type	Level	Flood-LLaMA		Flood-Qwen		Ground Truth
		A.P.A. (km²)	L.Acc (%)	A.P.A. (km²)	L.Acc (%)	A.P.A. (km²)
C. Flood	Level 0	$713.62$	$90.15$	$709.66$	$92.24$	$739.08$
	Level 1	$76.10$	$84.74$	$57.74$	$89.09$	$23.77$
	Level 2	$59.86$	$83.18$	$58.43$	$88.18$	$11.96$
	Level 3	$116.77$	$85.53$	$140.53$	$89.57$	$191.54$
R. Flood	Level 0	$695.81$	$95.03$	$696.40$	$90.74$	$707.56$
	Level 1	$89.98$	$84.28$	$79.38$	$88.55$	$58.65$
	Level 2	$81.89$	$84.11$	$74.28$	$85.52$	$44.68$
	Level 3	$98.67$	$85.11$	$116.29$	$85.19$	$155.46$
S. Flood	Level 0	$741.43$	$94.62$	$751.49$	$95.65$	$763.24$
	Level 1	$83.30$	$86.77$	$50.06$	$87.59$	$18.16$
	Level 2	$70.41$	$85.78$	$38.40$	$91.53$	$5.92$
	Level 3	$71.21$	$85.60$	$126.39$	$93.04$	$179.03$
O. Flood	Level 0	$393.21$	$84.39$	$407.09$	$88.83$	$432.94$
	Level 1	$119.66$	$74.63$	$95.82$	$73.52$	$41.94$
	Level 2	$194.92$	$80.55$	$180.89$	$85.37$	$161.50$
	Level 3	$258.56$	$73.89$	$282.55$	$73.15$	$329.97$

Table 6. Ablation study on Brisbane flood risk map data (%) (-Geospatial: Without Geospatial context, -Hydrological: Without Hydrological context, -Infrastructural: Without Infrastructural context, -Historical: Without Historical context; C. Flood: Creek flood, R. Flood: River flood, S. Flood: Storm tide flood, O. Flood: Overland flow flood; LLaMA3.2: LLaMA3.2-1B-Instruct, Qwen3: Qwen3-1.7B-Instruct).

Flood Type	Flood-LLM		-Geospatial		-Hydrological		-Infrastructural		-Historical
	LLaMA3.2	Qwen3	LLaMA3.2	Qwen3	LLaMA3.2	Qwen3	LLaMA3.2	Qwen3	LLaMA3.2	Qwen3
C. Flood	$88.80$	$90.40$	$87.71$	$89.29$	$84.93$	$86.76$	$87.36$	$89.33$	$87.28$	$88.77$
R. Flood	$90.75$	$88.55$	$89.32$	$88.27$	$87.29$	$85.72$	$89.78$	$88.66$	$89.01$	$88.54$
S. Flood	$91.75$	$95.45$	$90.23$	$94.01$	$86.64$	$90.28$	$90.51$	$94.29$	$90.68$	$94.33$
O. Flood	$80.55$	$87.60$	$79.28$	$86.55$	$79.25$	$86.46$	$73.81$	$81.06$	$78.48$	$85.61$

Table 7. Input features for a representative target property example (ID

13 * 46

, with * masking digits for privacy) and three of its neighboring properties. “Nbr” abbreviates “Neighbor”. (For clarity, auxiliary information such as coordinate values is omitted, and all floating-point values are rounded to three decimal places in this table).

Table 7. Input features for a representative target property example (ID

13 * 46

, with * masking digits for privacy) and three of its neighboring properties. “Nbr” abbreviates “Neighbor”. (For clarity, auxiliary information such as coordinate values is omitted, and all floating-point values are rounded to three decimal places in this table).

Attributes	Property 13 * 46 (Target)	Property 13 * 45 (Nbr 1)	Property 13 * 57 (Nbr 2)	Property 13 * 58 (Nbr 3)
Elevation	$7.133$ m AHD	$7.133$ m AHD	$7.133$ m AHD	$7.133$ m AHD
Coastline Distance	$20.392$ km	$20.371$ km	$20.391$ km	$20.401$ km
Record 1974	Flooded	Flooded	Flooded	Flooded
Record 2011	Flooded	Flooded	Flooded	Flooded
Record 2022	Flooded	Flooded	No Exposure	No Exposure
Waterways (in 100 m)	None	None	None	None
Gullies (in 100 m)	None	None	{ID P120 * 290 ⋯}	{ID P120 * 290 ⋯}
Pipes (in 100 m)	None	None	{ID P * 053 ⋯}	{ID P120 * 053 ⋯}
Neighbors (Adjacent)	{13 * 45, ⋯, 13 * 58}	{13 * 46, ⋯, 13 * 59}	{13 * 46, ⋯, 13 * 58}	{13 * 46, ⋯, 13 * 59}

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, J.; Wang, Y.; Manfredini, M. Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data. Sustainability 2026, 18, 2957. https://doi.org/10.3390/su18062957

AMA Style

Jiang J, Wang Y, Manfredini M. Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data. Sustainability. 2026; 18(6):2957. https://doi.org/10.3390/su18062957

Chicago/Turabian Style

Jiang, Jing, Yifei Wang, and Manfredo Manfredini. 2026. "Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data" Sustainability 18, no. 6: 2957. https://doi.org/10.3390/su18062957

APA Style

Jiang, J., Wang, Y., & Manfredini, M. (2026). Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data. Sustainability, 18(6), 2957. https://doi.org/10.3390/su18062957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Flood-LLM: An AI-Driven Framework for Property-Level Flood Risk Assessment Using Multi-Source Urban Data

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources and Preprocessing

2.2.1. Property Parcels and Elevation

2.2.2. Hydrological and Drainage Infrastructure Data

2.2.3. Historical Flood Records

2.2.4. Flood Risk Labels

2.3. Task Formulation

2.4. LLM Approach for Flood Risk Assessment

2.4.1. Relevant Info Agent

2.4.2. Div-Max Neighbor Info Agent

2.4.3. Learnable Estimation Agent

2.4.4. Overall Flood-LLM Framework

Computational Complexity Analysis

3. Results

3.1. Experimental Settings

3.2. Overall Performance Comparison Across Models

3.3. Disaggregated Performance by Flood Presence and Severity

3.4. Ablation Study

3.5. Parameter Analysis

3.6. Analysis of LLM Reasoning Processes

4. Discussion

4.1. Multi-Domain Integration and the Complexity of Urban Flood Risk

4.2. Neighborhood Context and the Identification of Localized Vulnerabilities

4.3. Prospects and Reflections on the Framework’s Methodological Potential

4.4. Opportunities and Limitations of LLM-Based Flood Risk Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI