1. Introduction
Floods remain one of the most disruptive and recurrent natural hazards worldwide, with increasing frequency and severity driven by climate change, rapid urbanization, and aging drainage infrastructure. In response, flood management has shifted from a focus on structural protection toward a more integrated flood risk management (FRM) paradigm that emphasizes vulnerability reduction, informed decision-making, and community resilience [
1,
2]. Within this paradigm, flood risk mapping is an essential tool for development guidance, preparedness planning, and public communication, and is increasingly regarded as a key instrument for supporting sustainable and climate-resilient urban development. Unlike traditional inundation mapping that describes physical characteristics such as extent, depth, and duration [
3], flood risk mapping synthesizes modeled flood behavior with contextual information, including historical flood records, surface and subsurface interactions, infrastructure conditions, and expert judgment to produce categorical planning-oriented risk levels [
4,
5]. Importantly, these outputs are not direct products of raw hydrodynamic simulations. Rather, they represent interpreted regulatory layers in which simulation results are integrated with professional judgment, local contextual knowledge, and policy considerations for urban planning.
Producing flood risk maps, however, is inherently challenging. Urban flooding results from the complex interaction of multiple domains, including topography, rainfall–runoff dynamics, drainage-network performance, and soil permeability [
6]. As a result, operational workflows in regions such as Australia and Europe, depend on computationally intensive hydrodynamic simulations, expert calibration, and multiple rounds of validation [
7,
8]. These methods provide effective, empirically grounded results but remain resource-demanding, difficult to scale, and heavily dependent on intra- and inter-domain expertise. Critically, such workflows often struggle to keep pace with changes in land use, surface characteristics, and infrastructure configuration caused by rapid urban densification, expansion, or redevelopment. Empirical models such as Random Forest, CatBoost, CNNs, or GNNs have been increasingly adopted to complement or accelerate such tasks [
9,
10,
11]. Yet, despite progress, existing data-driven approaches continue to face three major limitations.
First, multi-domain heterogeneity remains difficult to model. Flood risk assessment depends simultaneously on geographical, hydrological, infrastructural, and historical datasets that often use incompatible formats and scales. Conventional vector-based machine learning (ML) and deep learning (DL) models struggle to integrate these heterogeneous inputs coherently [
12,
13]. Second, spatial neighborhood, integrating micro-, meso-, and macro-factors, is not easy to model. The relevant contextual area for assessing a property’s flood risk may extend far beyond immediate adjacency, while the combined effect of multiple granular interventions varies significantly across locations [
14]. Models that either truncate the neighborhood or treat it uniformly risk losing important local cues related to micro-topography, upstream flow paths, or drainage-network connectivity. Third, interpretability remains limited. Many state-of-the-art models operate as black boxes, offering little transparency regarding how input factors contribute to predicted risk levels [
15]. This lack of explainability constrains their reliability for decision-makers, practitioners, and the public. A summary of representative studies in this area is provided in
Table 1.
These persistent limitations motivate the exploration of complementary analytical methods, such as large language models (LLMs), which explore new directions for flood risk mapping. LLMs possess advantages that are not available in traditional ML/DL architectures: they can integrate heterogeneous information expressed in natural language, reason across domains, and generate interpretable chains of thought. Recent research has shown that LLM-based systems can emulate expert reasoning patterns in hydrology and disaster assessment [
27], suggesting their potential to support risk-mapping workflows that require extensive contextualization.
To address the challenges outlined above, we have developed Flood-LLM, a multi-agent LLM-based framework designed to construct representations of diverse geospatial inputs that summarize spatial neighborhoods, and learn from expert-interpreted planning mapping. Rather than modeling physical flood dynamics, the Flood-LLM framework explores how an LLM-based approach can use heterogeneous spatial, environmental, and infrastructural indicators to support flood-risk classifications for planning mappings. In this sense, our aim to support the formulation of complementary data-driven methods creates a novel integrated framework that indicates its potential capacity by validation against the label assignment patterns of official flood maps. Accordingly, this paper addresses the following research questions: RQ1: Can LLMs integrate heterogeneous geospatial, hydrological, drainage-network, and historical flood data to approximate the property-level flood risk classifications used in planning maps? RQ2: How can LLM-based systems incorporate diverse and uneven spatial neighborhoods when learning for mappings? RQ3: To what extent can LLM-generated outputs provide interpretable insights into how these planning classifications adequately associate heterogeneous urban indicators?
To answer these questions, Flood-LLM introduces three coordinated agents: a Relevant Info Agent that gathers and narrates multi-domain inputs; a Div-max neighbor Info Agent that extracts semantically diverse neighborhood contexts; and a Learnable Estimation Agent fine-tuned via LoRA (low-rank adaptation) to approximate expert-informed planning labels (See
Figure 1). The framework enables cross-domain reasoning without requiring explicit vectorization and generates interpretable reasoning chains alongside risk predictions.
To clarify the intended scope of Flood-LLM, we submit that Flood-LLM offers neither an alternative to hydrodynamic simulation nor a tool for direct statutory or regulatory decision-making. Instead, Flood-LLM is introduced as an exploratory methodological framework to deploy large language models within complex mapping processes. By integrating heterogeneous spatial datasets, the framework facilitates the generation of expert-interpreted planning labels. It is conceived as an exploratory investigation into whether such an approach could serve as a multi-level support tool for practical, effective, and rapid productive flood risk analyses, which are particularly needed in situations where full physical modeling is constrained by insufficient data availability, processing time, or expert/computational resources. By operationalizing different layers of the flood-risk knowledge system as proxy indicators, Flood-LLM provides planning-oriented risk indications more rapidly than conventional workflows.
This learning paradigm necessarily conditions the model’s behavior on the characteristics of the planning classifications on which it is trained, and therefore may also reflect the uncertainties and limitations inherent in those underlying data sources. However, it also reveals a positive implication for transferability, as it suggests the potential for adaptation to different spatial contexts through retraining on locally defined classifications. Crucially, if approaches of this kind were ever to be considered for practical use in the future, such adaptation would require careful attention to issues of expert oversight, ethical governance, and accountability.
The contribution of this study, therefore, lies in exploring the potential of a new technological approach for assisting early-stage assessment, rapid updating, and data-constrained analysis. Specifically, the main contributions of this paper are as follows: (1) we explore the feasibility of an LLM-based framework for approximating property-level flood risk classifications from heterogeneous multi-source urban data; (2) we design a diversity-based neighborhood information mechanism to examine how spatial dependencies among properties can be incorporated into such a framework while remaining computationally tractable; (3) we investigate whether Flood-LLM can reproduce high-resolution planning-oriented flood risk labels that are consistent with expert-reviewed classifications; and (4) we conduct a systematic empirical evaluation using real-world datasets from a salient case, urban Brisbane, including ablation studies, neighborhood parameter analysis, and comparative benchmarking to assess the behavior and limitations of the approach.
The remainder of this paper is organized as follows.
Section 2 introduces the study area, data, task formulation, and the proposed Flood-LLM framework, together with implementation details.
Section 3 presents the experimental results, including comparisons with classical ML, DL, and LLM baselines, quantitative and visual evaluations against the Council flood maps, and additional analyses such as an ablation study, parameter sensitivity, and illustrative reasoning cases.
Section 4 discusses the implications, opportunities, and limitations of LLM-based flood risk assessment.
Section 5 concludes the paper and outlines future research directions.
2. Materials and Methods
This section describes the study area, data sources and preprocessing steps, the formulation of the flood risk estimation task, and the proposed Flood-LLM framework, followed by the experimental design.
2.1. Study Area
The empirical study focuses on the Brisbane metropolitan area, a major coastal city in Queensland, Australia, that is highly exposed to multiple flood hazards. Brisbane is affected by riverine flooding along the Brisbane River and its tributaries, local creek flooding in smaller catchments, storm tide flooding in low-lying coastal areas, and intense rainfall generating overland-flow flooding across the urban fabric. The city has experienced several major flood events in recent decades, including those in 1974, 2011, and 2022, which have led to recurrent updates of flood awareness policies and maps by Brisbane City Council (BCC). This context provides a rich and policy-relevant testbed for exploring property-level flood risk estimation.
2.2. Data Sources and Preprocessing
This study integrates multiple geospatial datasets to derive property-level flood risk indicators and corresponding supervision signals for Brisbane.
Generally, property parcels, topography, hydrological systems, drainage infrastructure, and historical flood extents are combined to structured property-level feature representations used as model inputs. Property parcels serve as the fundamental spatial unit for analysis, with neighborhoods defined through parcel intersections that provide a spatial scaffold for aggregating contextual information. Within this framework, hydrological and topographic data approximate upstream–downstream influences, drainage infrastructure captures functional stormwater connectivity, and historical flood extents encode empirical exposure patterns, enabling neighborhood interactions to reflect not only geometric proximity but also hydrologically informed relationships (See
Figure 1).
In addition, flood risk labels are derived from the BCC’s Flood Awareness Maps, which serve as the official reference for planning and flood risk communication and are used in this study to construct property-level flood risk representations as model targets. These maps synthesize Council-endorsed flood studies and modeling outputs to delineate the extents and likelihoods of four flood types, including creek, river, storm tide, and overland flow flooding, with risk layers derived from approved hydrodynamic and overland flow models. Property parcels serve as the fundamental spatial units to which flood risk labels are assigned. Each parcel is assigned a four-component flood risk label, which provides the ground truth for model training and performance evaluation.
2.2.1. Property Parcels and Elevation
The primary spatial dataset is the BCC Property Parcel dataset, provided in Shapefile vector format. The dataset represents individual land parcels as polygon geometries, where each polygon is defined by an ordered set of planar coordinates forming a closed shape that delineates the legally defined boundary of a property. Each polygon is linked to an attribute table containing descriptive information for the corresponding parcel, including cadastral identifiers, address details, land tenure, and lot area. The shapefile is referenced to a local projected grid system, the Map Grid of Australia Zone 56 (MGA56), based on the Geocentric Datum of Australia 1994 (GDA94) (All coordinates used in this study follow this coordinate system), enabling accurate spatial measurement and integration with other spatial layers. In total, the dataset comprises 519,009 property parcels, providing a city-wide, parcel-level representation of Brisbane’s land subdivision structure (See
Figure 2).
Elevation information in this study was obtained from the Brisbane Contour Map dataset accessed via the Queensland Spatial Catalogue. The source data are provided as vector contour lines, where each polyline represents a line of constant elevation above sea level and carries an elevation attribute. To enable property-level elevation attribution, the contour data were processed in ArcGIS Pro 3.4 to generate a raster elevation surface, in which elevation is represented as a grid of regularly spaced cells storing continuous height values. This rasterization step allows elevation values to be queried at specific point locations. The resulting raster dataset is referenced to the coordinate framework as the property parcel data, ensuring spatial consistency for subsequent analysis (See
Figure 3a). Thus, the elevation of each property’s centroid was used as its assigned elevation value.
In practice, property centroid coordinates are utilized as the input for all models, including our Flood-LLM and all baselines. To clarify the notation used throughout this paper, we let
represent a single property and define the following:
The complete set of properties under consideration, where
denotes the total number of properties and
indexes the
i-th property. For each property
, the centroid coordinates
, the centroid coordinates
is utilized as AI model inputs:
Specifically, while coordinate values can be directly used by classical ML and DL models, LLM-based approaches require transferring the numerical inputs into texts. In practice, this converted process is performed with Python 3.10’s built-in str() function (Similar string-conversion functions are widely available in programming languages and are commonly used to transform numerical or structured data into text. The specific implementation details are simplified here as they do not affect the discussion).
2.2.2. Hydrological and Drainage Infrastructure Data
To characterize Brisbane’s surface and subsurface drainage systems, several additional datasets, sourced from the BCC Open Data Portal in Shapefile format, are incorporated, all referenced to the same projected coordinate system for seamless overlay and network analysis. The waterway network dataset comprises 18,227 polyline features, where each polyline is defined by an ordered sequence of coordinates representing the centreline of a natural or semi-natural watercourse (See
Figure 3b). These linear geometries capture the spatial paths of creeks, streams, and drainage channels with either intermittent or permanent surface flow.
The gully dataset contains 167,970 point features that indicate the locations of stormwater inlets where surface runoff enters the drainage system, each point defined by planar coordinates. The drainage pipe dataset consists of 280,229 polyline features representing the spatial alignment of underground stormwater pipes, with each polyline tracing the path of a subsurface conduit (See
Figure 4).
The unstructured geometries and heterogeneous formats of waterways, gullies, and drainage pipes make them difficult to encode using standard feature representations. To preserve this information for subsequent LLM queries, we store it using a Polygon–Text data structure. Let
denote the complete set of hydrological features, including waterways, gullies, and pipes. Let
denote an individual entity of them. Within a
geopandas framework, each such entity is represented as a polygon, defined by a sequence of points outlining its boundary. A specific polygon can be mapped to a set of coordinates, where each point is represented by a 3-dimensional coordinate vector. Thus, for entity
, we define the following:
where
denotes the positive closure over
, representing a finite but unbounded sequence of 3-dimensional coordinate vectors. This allows the representation of polygons with any number of vertices. Additional metadata for entity
is stored as follows:
Here, denotes the vocabulary of text tokens and denotes a sequence of textual tokens. The content of varies by corresponding . For a waterway, may include fields such as name and water type, whereas for a pipe, it may contain width, material, and construction date. The length of is variable, depending on the richness of the published data from the BCC.
The variable nature of both
and
renders them incompatible with conventional machine learning or deep learning models, which typically require fixed-dimension vector inputs. For LLM utilization, we design an agentic framework comprising an LLM and specialized tools (detailed in
Section 2.4.1) to dynamically retrieve and incorporate relevant information into the model’s input context.
2.2.3. Historical Flood Records
Historical flood information was also obtained from BCC’s open data portal in shapefile format, which provides mapped flood inundation extent datasets documenting the observed spatial footprint of flooding for three major events in 1974, 2011, and 2022 (See
Figure 5). Each flood event layer consists of a set of polygon features whose geometries delineate the mapped boundaries of areas that were underwater during that event. The 1974 flood record is represented by a single polygon capturing the overall inundation extent, whereas the 2011 and 2022 datasets comprise 802 and 776 polygons, respectively, reflecting a more spatially detailed delineation of inundated areas.
In the implementation, the historical record is represented as vectors, defined by the following:
2.2.4. Flood Risk Labels
Flood risk labels are derived from BCC’s Flood Awareness Maps, which are planning-oriented classification products that integrate modeling outputs, historical studies, and regulatory interpretation into discrete exposure zones intended for public communication and development assessment.
Specifically, it delineates exposure zones for four flood types, creek, river, storm tide, and overland flow. For each of them, the Flood Awareness Maps are provided as a separate Shapefile-based vector dataset, in which a collection of polygon features represents flood exposure. Each polygon delineates a geographic zone of uniform flood risk for a given flood type and is accompanied by attribute information specifying the flood type and its associated risk level. In terms of data volume, the creek flood dataset contains 45,546 polygons, the river flood dataset 5308 polygons, the storm tide dataset 34,808 polygons, and the overland-flow dataset 1,246,568 polygons, indicating substantial variation in the spatial granularity with which different flood processes are mapped.
Across all flood types, BCC derives risk levels from flood study modeling outputs associated with annual exceedance probability (AEP) scenarios, which are then translated through expert interpretation into discrete planning categories of likelihood or impact. Specifically, for creek, river, and storm tide floods, risk levels are expressed as AEP categories: high (5%), medium (1%), low (0.2%), and very low (0.05%). For overland-flow flooding, risk is defined in terms of impact: high (5%), medium (2%), and low (1%).
In this study, the “low” and “very low” categories are merged for consistency across flood types. This merging reflects how these strata are operationalized in practice. Although the Flood Awareness Maps distinguish between “low” and “very low”, both categories are consolidated into a single 0.2% AEP designation in the Council’s “FloodWise Property Report”, which is the primary parcel-level reporting instrument supporting development assessment in Brisbane. Similarly, the Council’s public guidance document “Flooding in Brisbane: A Guide for Residents” [
28] issues identical recommendations for residents in the two strata, indicating functional equivalence for household preparedness and residential decision-making. Therefore, merging the two categories aligns the experimental taxonomy with their real-world usage and preserves the ordinal structure of flood severity (high > medium > low) across all flood types.
This practical alignment, however, reduces label granularity and may diminish the model’s sensitivity to marginal flood risk, particularly where subtle variations occur at the urban fringe or in redeveloping areas, potentially limiting its ability to capture such transitional patterns.
In addition, it is important to recognize that the flood risk labels carry inherent uncertainties and limitations because the underlying map layers were developed from separate modeling studies conducted at different times and are neither synchronously nor continuously updated. For example, some statistical likelihood layers originate from earlier studies, such as the official creek and storm tide layers from 2012 and 2013 and the overland flow layer from 2017, while historical flood extents, including those associated with the 2022 Brisbane River crest, are based on event-specific observations. Rather than being refreshed through regular physical modeling cycles, the map is updated on a project basis when new flood study results become available, as illustrated by the 2025 revision, which updated the creek flood mapping for over 17,000 properties. These variations in data sources and update cycles may influence Flood-LLM by introducing inconsistencies that become embedded in the learned pattern, and therefore should be considered when interpreting the model’s behavior.
All risk levels are encoded numerically in this study as follows: 0 (no exposure), 1 (low), 2 (medium), and 3 (high) following the rule defined by BCC [
29,
30,
31,
32]. To assign these flood risk labels to individual properties, a spatial join was conducted in ArcGIS (A spatial join is a location-based matching process in which attributes from one spatial layer are transferred to another based on their geometric relationship). As all spatial layers are provided with coordinates, property parcel polygons and flood risk polygons can be directly overlaid without coordinate transformation.
The procedure was applied separately to each flood type by spatially joining the property parcel layer with the corresponding flood risk layer. For each flood type, a property was assigned the highest intersecting risk level if it overlapped with one or more flood risk polygons, and 0 (no exposure) if no intersection occurred. Repeating this process across the four flood types yields a four-dimensional flood risk vector for each property, with one risk level per flood type. The distribution of properties across risk levels for each flood type is summarized in
Figure 6.
For each property
, the corresponding 4 labels can be denoted by the following:
Then, the overall dataset can be defined as follows:
To ensure a strict separation between model training and evaluation, we adopt a spatial hold-out strategy [
33] when using the flood risk labels. Brisbane comprises 190 suburbs, among which 38 suburbs are randomly selected as the training region, while the remaining 152 suburbs are reserved exclusively for testing.
Let
denote the set of training suburbs and
denote the set of testing suburbs, such that,
Each property
belongs to a suburb
, and the training and testing sets of properties are defined as
Using expert-provided flood risk labels
, the corresponding datasets are
Under this partition, among the properties in Brisbane, properties are assigned to the training set and properties to the test set. Properties located in the training suburbs provide labeled examples for model learning, whereas the flood risk labels in the test suburbs are retained exclusively as ground truth for evaluation.
This design simulates realistic operational conditions for parcel-level flood risk mapping, where the physical environment of the entire city (e.g., terrain, drainage systems, and waterway geometry) is observable, but expert-reviewed flood risk labels are available only for a limited subset of locations. Under such conditions, the model can serve as a fast pre-estimation tool to provide preliminary risk assessments before comprehensive expert evaluation becomes available.
2.3. Task Formulation
Flood risk maps are widely used as planning instruments for long-term disaster risk management and sustainable urban development. They are typically developed through workflows that combine hydrodynamic modeling with expert interpretation and local knowledge to assign flood risk classifications to individual properties. Updating these maps in response to changes in properties, infrastructure, or environmental conditions can be time-intensive, as the process often involves repeated modeling, data integration, and expert assessment. As a result, the production and updating of such maps may constrain the timely availability of planning-oriented risk information needed to support adaptive and sustainable planning decisions.
To explore the feasibility of emerging AI-based approaches in flood risk analysis, this study examines whether an LLM-based framework can learn the label assignment patterns embedded in expert-reviewed flood risk maps from heterogeneous urban and environmental indicators. It is presented purely as an exploration of new methodological possibilities that may, in the future, offer potential as preliminary or pre-screening support tools for more timely and sustainable flood risk management, subject to careful oversight, accountability, and ethical considerations. Here, we model expert decision-making as an unknown mapping function, denoted as
f, that satisfies for any
:
which assigns each property a vector of flood risk levels across the considered flood types. In practice, the function
f is governed by expert knowledge and heuristics, and its explicit form is typically unknown. Only a limited set of property–risk level pairs provided by experts following this function is available.
To approximate this unknown mapping using machine learning techniques, we seek to learn a function
parameterized by
:
Suppose there exists a
such that the model
most closely approximates the expert decision-making function
f. Given the training data
used in the learning process, the estimation of
can be formulated as the following optimization problem:
where
denotes the norm of the input. We can then obtain the output of the learned model
on the test set
, which denotes the properties for which expert-labeled flood levels are not available:
Using the predicted flood levels
enables efficient estimation of expert-assessed flood risk levels for each property
, thereby facilitating the rapid generation of preliminary flood risk maps. Thus, we formulated the flood risk assessment task as the optimization of
. In the following
Section 2.4, we’ll show the technical implementation to solve this optimization (Equation (
37)).
2.4. LLM Approach for Flood Risk Assessment
Considering the capacity of large language models to integrate heterogeneous information through textual reasoning, we develop an LLM-based framework for flood risk assessment, termed Flood-LLM. The framework comprises three key components. First, a Related Info Agent constructs a structured context of property-specific information from heterogeneous data sources. Second, a Div-max Neighbor Info Agent incorporates contextual information from surrounding properties to represent relevant neighborhood conditions. Finally, a Learnable Estimation Agent employs a training strategy that enables the model to learn expert-informed labeling patterns and provides a textual reasoning trace that supports interpretability of the estimation process.
2.4.1. Relevant Info Agent
Advances in information technology have enabled the widespread collection of digital representations of real-world properties by various organizations. These geospatial, hydrological, and infrastructural datasets provide a critical foundation for flood risk estimation. However, conventional information methods often struggle to supply downstream decision-making models with enough domain-specific context in limited input length.
To address this challenge, we develop an LLM agent (See
Figure 7) that leverages specialized geometric operations from the open-source geometry processing library
geopandas [
34]. This agent retrieves and summarizes property-relevant information to augment the prompt context for downstream flood risk estimation. The system is designed to be highly modular, such that adding or removing specific data sources does not disrupt the overall workflow.
With Polygon-Text saved as shown in
Section 2.2.2. For a property
, the polygon processing functions in
geopandas, including
GeoSeries.intersects,
GeoSeries.distance, and other necessary functions, are utilized to produce the perpendicular distance between any property and hydrological or drainage infrastructure element
in
. To simplify the discussion, here we denote it with a
, which produces the perpendicular distance between a property
and the
of
. Let
be a threshold to control the search distance. Then we define a retrieval function
accordingly:
Let ⊕ denote the text concatenation operator. Accordingly, for a property
, the retrieved information is given by the following:
The context generated by the Relevant Info Agent for property
is constructed by concatenating its coordinates, nearby hydrological and drainage infrastructure information, and its historical flood records (Inpractice, a fixed LLM is utilized to pre-process
for the construction context for downstream flood risk model. “Fixed” indicates that we use open-source LLM directly, without modifying its learned parameters. As the detailed architecture is relatively complex and not essential to the present discussion, we omit those details here; a more in-depth analysis is provided in
Section 2.4.3. In our experiments, we adopt the base, unmodified versions of each model: the original LLaMA for Flood-LLaMA and the original Qwen for the Relevant Information Agent):
The is fed into the downstream LLM for neighbor-property information construction, providing contextual information about neighboring properties.
2.4.2. Div-Max Neighbor Info Agent
Our framework requires each property representation to encode not only its own attributes, but also the conditions of nearby parcels that shape its flood exposure. To meet this requirement, we introduce the Neighbor Info Agent, which collects geohydrological, drainage, and historical attributes from parcels surrounding a target property (see
Figure 8). The agent operates through a graph-style recursive context aggregation mechanism: each property node integrates information from its adjacent neighbors and progressively incorporates contributions from higher-order neighbors. This process yields a neighborhood representation that captures broader spatial patterns relevant to inundation risk.
Specifically, for a property
, we define a neighborhood retrieval function
, which identifies adjacent properties using the
GeoSeries.intersects function from the
geopandas package. The resulting neighborhood set
is given by the following:
Although only adjacent neighbors of a specific property are included in its neighborhood set from a local perspective, a global perspective reveals that recursive multi-hop neighborhood connections can model long-term potential relationships between any two properties through this adjacency relation (i.e., any property in the city can be a multi-hop neighbor of a given property). This constructs a large-scale graph where nodes represent properties, and edges consist of both adjacency-based neighborhood relationships and paths along these neighborhood relations that connect any two properties within the city.
We term this urban simulation approach “graph-style recursive aggregation.” Beyond this neighborhood relation, no complex relationships were simulated. This is because the model may be applied in pre-expert scenarios, and constructing hydrologically meaningful relationships without bias or noise would be time-consuming to implement in such scenarios.
Our approach here is to provide graph-style recursive aggregation, which globally incorporates all possible paths between any two properties, and to leverage data to train the model to learn inherent patterns (e.g., upstream-downstream relationships and their roles across different flood types).
We then apply the context construction process defined in Equation (
17) to generate contextual descriptions for each neighboring property. However, in dense urban environments, the resulting context can become excessively long, potentially distracting the downstream model and substantially increasing token consumption.
To mitigate this issue while preserving informative content, we introduce an embedding-based filtering strategy.
We apply a variant of LLM, the parameter-fixed LLM embedding model, expressed as
to generate vector representations of each neighbor’s context (Given thatthe detailed architecture is relatively complex and not essential to the present discussion; we also omit the specifics of this modification here. A more in-depth introduction is provided in
Section 2.4.3. In fact,
is an LLM without the “tokenizer decoder” component and with fixed learnable parameters
(as defined in Equation (
28) of
Section 2.4.3)).
Let
denote the predefined embedding dimension. The parameter-fixed LLM embedding model is defined as a function that takes text—composed of token sequences—as input and outputs numeric vectorized representations of the context, i.e., the embeddings:
Such vector representation is usually known as the “embeddings” of the input neighbor’s context in the LLM community.
For a neighbor
, its embedding
is computed as follows:
Specifically, these embeddings generated by the LLM exhibit a unique property: for text samples, the Euclidean distance between embeddings of semantically similar texts is consistently smaller than that between embeddings of semantically dissimilar texts. We leverage this property of the embeddings to reduce the context complexity of the target property while retaining contextual information from its neighbors that possess distinct characteristics. This is achieved by merging neighbors with similar contextual features.
In this process, the first step is to identify a representational variance set
. This
consists of neighbors whose embeddings exhibit maximal variance from one another, meaning they cannot be easily characterized as “similar to another neighbor already included in the set
”:
For the remaining neighbors
, we use embedding similarity, measured via Euclidean distance, to determine which neighbor in
they are most similar to
We then merge the long description of these neighbors
using a concise description: “
is similar to
.” Thus we construct the modified context
for each neighbor
as follows:
Finally, we define the aggregated context by the Div-max Neighbor Info Agent for the target property
as follows:
This context serves as a compressed yet information-rich representation of the neighborhood of the target property , and enables the downstream model to relate to any other property in the city via graph-style recursive neighbor aggregation.
2.4.3. Learnable Estimation Agent
Unlike traditional machine learning methods, large language models (LLMs) excel at integrating information from heterogeneous and complex domains. Although they differ in architectural details, most LLMs consist of two main components: fixed tokenization functions that map input text to vector representations and convert vectors back into text (i.e., the tokenizer encoder and decoder), and a neural network with learnable parameters. During inference, input text is represented and processed by the LLM as a sequence of vectors (see
Figure 9).
Let
and
denote the predefined input and output vector dimensions of the LLM, respectively. The tokenizer encoder is defined as
Similarly, the tokenizer decoder is defined as
The tokenizer encoder functions implement fixed mappings that encode input text into real-valued vectors, which can be processed by a downstream neural network, and decode the resulting vector representations back into text (There are many different tokenizer designs. Each open-source LLM is released with its own corresponding tokenizer. As this paper does not address modifications to the tokenizer component, the detailed design of the tokenizer is omitted).
The learning capability of an LLM is determined by its neural network component, whose parameters can be adjusted to approximate desired output patterns and generate appropriate responses. Although the architectural designs of neural networks vary across different LLMs, they can all be expressed as compositions of parameterized functions with distinct learnable parameters. Specifically, these neural networks are constructed by stacking basic functional units, commonly referred to as “layers”.
To simplify the discussion, let
denote the total number of layers in the neural network, and
index a specific layer. We then denote the parameter set
with a single matrix
that aggregates all learnable parameters in the parameter set, where Let
and
denote the dimension of large enough to contain all
in
and rewrite the neural network as
Finally, an LLM with learnable parameters
can be expressed as
Let
denote a prompt that guides the LLM to estimate flood risk levels based on the combined property context
, and return the output in a structured format (e.g., placing the result between the delimiters “
$[” and “]
$”). The generated textual output for property
is then given by
Although LLMs perform well on general tasks, domain-specific applications such as property-level flood risk estimation continue to require alignment with expert reasoning processes. However, data collected from human experts typically contains only final estimation results, without the intermediate reasoning or chain of thought.
To fine-tune the LLM using only these ground-truth labels, we define a mask function
that extracts the estimated risk vector from the model output using the format enforced by
:
Equation (
30) means that only outputs that satisfy the constraints will be utilized, while the others will be discarded to facilitate automatic extraction by the program (In fact, the
in Equation (
30) is converted from a string to an integer array using standard string-conversion functions that are widely available in most programming languages, as discussed below Equation (
2). This design is simplified here because such processing is common in computer algorithms and does not affect the discussion). Specifically, the constraint in Equation (
30) consists of two parts. First, the constraint that only outputs between “[” and “]” will be extracted, ensuring that only outputs following the format enforced by
are considered. Second, the constraint that extraction occurs only when
ensures that the output contains valid predicted flood levels consistent with the format of expert predictions
, as defined in Equation (
6). Thus, we can extract the four integers from the output of
:
For each property
, we apply a variant of the norm function in Equation (
13) that is augmented to handle
None values as scores:
The value of
reflects the model’s confidence and accuracy: higher values indicate closer alignment with the expert label, while lower values suggest greater discrepancy. Then applicable variation of optimization (
13) for fine-tuning Flood-risk estimation LLM
can be denoted by optimization:
Specifically, all open-source LLMs are pretrained by their providers. Let
denote the pretrained parameters released by the LLM provider. The optimization problem in (
33) can then be reformulated as learning a parameter modification
applied to the pretrained parameters
:
However, given the large number of learnable parameters,
can be prohibitively large. As a result, directly optimizing (
34) requires substantial GPU memory and training time ( Even for the “smaller” 3b-scale LLMs used in this study, directly optimizing (
34) may require more than 32 GB of GPU memory). To mitigate this computational burden, we fine-tune the LLM using the Low-Rank Adaptation (LoRA) approach [
35]. Under this approach, the update matrix
is parameterized as the product of two low-rank matrices,
and
, where
. Combining Equations (
32)–(
34), the Learnable Estimation Agent utilizing this approach can be denoted by
In this approach, the dense matrix is approximated by the product of two much smaller matrices, and , which effectively reduces memory consumption to approximately one quarter of that required by the original approach.
Our Learnable Estimation Agent learns the final learnable parameters using this LoRA-based approach:
The output can be directly applied as the LLM parameters, enabling the LLM to predict flood levels by following the patterns learned from the samples in .
2.4.4. Overall Flood-LLM Framework
Combining the components described above, we propose the Flood-LLM framework to estimate the flood risk level
for any property
, i.e., properties for which the flood levels are unknown:
Equation (
37) provides the implementation to generate
for
, i.e., the solution for the formulated task in Equation (
14),
Section 2.3. The workflow is summarized in Algorithm 1 and
Figure 1.
| Algorithm 1 Learning procedure of Flood-LLM |
Input: training set , pretrained LLM Parameters: number of training epochs T, prompt Output: fine-tuned LoRA parameters and - 1:
Randomly initialize A. - 2:
Initialize . - 3:
for each neighbor do - 4:
Generate according to Equation ( 16). - 5:
Generate and as Equation ( 21). - 6:
end for - 7:
for epoch to T do - 8:
for each sample do - 9:
Generate according to Equation ( 17). - 10:
if then - 11:
Generate modified context according to Equation ( 24). - 12:
else - 13:
continue - 14:
end if - 15:
Compute score as in Equation ( 32). - 16:
end for - 17:
Optimize LoRA parameters by maximizing as in Equation ( 35). - 18:
end for - 19:
save and - 20:
return and
|
Computational Complexity Analysis
In the training process, Step 5 of Algorithm 1 involves the repeated optimization of Equation (
21). A concern is whether this optimization is excessively time-consuming. We analyze the computational complexity of solving the optimization problem in Equation (
21), which aims to select the optimal neighbor subset for each property
. First, we clarify the values of key parameters involved in the complexity calculation: the dimension of the property representation vector
H,
, is set to 1024 in accordance with the suggestions of LLM providers (e.g., Qwen-3 and LLaMA-3.2); the average value of
in the Brisbane; the size of the output optimal neighbor subset,
k, is determined as 8 via parameter analysis (detailed in
Section 3.5); the total number of properties in Brisbane, as described above, is
= 519,009. The optimization problem in Equation (
21) is a subset selection task that maximizes the variance of neighbor representation vectors under the constraint
. We decompose its complexity for a single property
and then extend this analysis to the entire dataset.
For a single property , the core computation consists of two parts: calculating the variance for a candidate subset , and traversing valid candidate subsets to identify the optimal one. For the variance calculation of , for a subset with size t (), we first compute the mean representation of the t vectors (with a computational cost of ) and then calculate the squared Euclidean norm between each vector and this mean (with an additional cost of ). The total cost for one subset is , and since , this cost is bounded by . To traverse the candidate subsets and find , we need to traverse all valid subsets of with size . Given and , the computational complexity of this step is approximately 256, which is a relatively small constant (denoted as ) that is independent of the total number of properties .
The overall complexity for the entire dataset can be derived by combining the two parts above. The computational cost for a single property
is
. Since
,
k, and
are all fixed parameters, we execute the above single-property computation independently for each
in the entire dataset with
properties. Given that the computational cost for each property
(where 519,009 is approximately 61 times the value of
), the overall complexity of solving Equation (
21) is:
Since the overall complexity is linear with respect to
(i.e.,
), the computational overhead of solving Equation (
21) is negligible in the entire pipeline and fully acceptable for large-scale datasets such as the Brisbane urban planning dataset.
Specifically, if we utilize this algorithm in cities with complex neighborhood relations, we can reduce this complexity by directly applying for entities with , as these entities already possess small neighbor sets.
3. Results
3.1. Experimental Settings
Following the standard evaluation framework for classification [
9,
36], we employ the following metrics to assess predictive performance confusion matrix and accuracy.
Confusion matrix for multi-class classification (
Table 2). A confusion matrix for multi-class classification is used to present the classification results by comparing the predicted labels against the actual labels. For a classification problem with
classes, the matrix is of size
, where each row corresponds to the predicted class and each column corresponds to the actual class. The diagonal elements represent correct predictions, while off-diagonal elements indicate misclassifications.
In this matrix, each entry denotes the number of samples that belong to actual class j but were predicted as class i. For a specific class c:
- –
True Positives for class c (): the number of samples correctly predicted as class c, i.e., .
- –
False Positives for class c (): the number of samples from other classes that were incorrectly predicted as class c, i.e., .
- –
False Negatives for class c (): the number of samples from class c that were incorrectly predicted as other classes, i.e., .
- –
True Negatives for class c (): the number of samples correctly predicted as not class c, i.e., all entries excluding row c and column c.
Accuracy. This metric judges the global alignment of the results [
36] and is calculated as follows:
Level Accuracy (L.Acc). Specifically, to evaluate the model’s performance across different flood risk levels (Level 0–3), we compute the accuracy for each class
c:
For the compared approaches, drawing on methods that have been widely adopted and empirically validated in the flood prediction literature, we select a set of representative machine learning and deep learning models for comparative evaluation. Support Vector Machines (SVM) and Random Forest (RF) are traditional machine learning methods that have been shown to effectively capture flood data patterns [
9,
26]. SVM works by finding the optimal hyperplane to separate data classes, while RF improves prediction accuracy through ensemble decision trees. Multilayer Perceptron (MLP) and Graph Convolutional Network (GCN) are among the top-performing deep learning methods for flood modeling [
25]. MLP is a classic deep learning model with fully connected layers, whereas GCN leverages graph convolutions to incorporate neighboring information, enhancing its effectiveness.
Additionally, we employed open-source large language models (LLMs) without Flood-LLM fine-tuning as baseline comparisons, using LLaMA3.2-1B-Instruct [
37] and Qwen3-1.7B-Instruct [
38]. These models were chosen for their strong performance and widespread adoption while maintaining manageable computational cost. Given the shared transformer-based architecture of modern LLMs, experiments on these models are sufficient to demonstrate the compatibility of our framework with this class of models. Both models take inputs from the Relevant Info Agent and Div-max neighbor Info Agent to assess their performance in flood risk prediction.
We conduct the experiments on a high-performance system equipped with two NVIDIA A800 GPUs for efficient training and inference. The system is powered by a 32-core CPU and is equipped with 80 GB of RAM, ensuring smooth processing and fast computation for large-scale models and datasets.
The threshold
in Equation (
15) is set to 100 m to limit the search scope, consistent with commonly adopted settings in prior studies [
39,
40,
41]. The rank parameter
in Equation (
35) is set to 64. Matrix
A is initialized with all-zero values, while matrix
B is initialized using a Gaussian distribution. This initialization strategy follows the recommendations of the original LoRA paper [
35] and is adopted to ensure stable and efficient optimization. For all methods, we employ the Adam optimizer [
42], which is widely used and well established in the deep learning literature. For the base LLMs, we adopted the identical prompt as that used for our Flood-LLM, as these models consistently failed to generate valid final outputs without this prompt. Given that this study employs relatively small-scale LLMs, which are inherently limited by poor generalization capability, we prioritized fine-tuning and did not conduct additional prompt engineering experiments on the base models. All remaining hyperparameters and experimental settings strictly follow those reported in the respective original papers.
3.2. Overall Performance Comparison Across Models
Table 3 presents the overall accuracy of various models across four flood types.
Classical ML methods (SVM and Random Forest) and the MLP operate on vector-based property-level inputs, including property coordinates, elevation, and boolean-encoded flood history. Their performance is limited as these models rely solely on tabular representations and cannot explicitly capture spatial or relational dependencies, which are critical for flood risk estimation.
The GCN extends this representation by taking a graph-structured input, where each node corresponds to a property encoded by the same vector features (coordinates, elevation, and historical indicators), and edges model neighborhood relationships. This enables the incorporation of spatial context and leads to improved performance over the MLP. However, GCNs remain constrained by over-smoothing during multi-hop aggregation, which hampers their ability to represent complex and heterogeneous urban spatial patterns.
All LLM-based approaches use the same text-based property description as their explicit input. For each property, this description encodes property coordinates, elevation, and boolean flood history in natural language form. In addition, the Relevant Info Agent integrates heterogeneous geospatial, hydrological, and infrastructural data into coherent natural language inputs. The Div-max neighbor Info Agent enhances spatial reasoning by selecting semantically diverse and representative neighbors, preserving essential contextual information. This context supports the model’s chain-of-thought reasoning and final prediction.
Interestingly, despite this enriched contextual access, general-purpose LLMs (LLaMA3.2 and Qwen3) remain untrained on flood-risk data and consequently exhibit very poor performance. Detailed statistics on the valid output ratio and the conditional accuracy given valid outputs for base LLM approaches are reported in
Table 4. These results reveal two critical limitations of directly applying base LLMs to the flood risk estimation task:
(1) The results indicate that, without fine-tuning, base 1B-scale LLMs struggle to produce valid outputs. The relatively low valid output ratio suggests that although base LLMs are capable of ingesting and processing complex textual inputs, they often fail to formulate outputs that conform to the required format or constraints. Even when the model internally arrives at a correct estimation, invalid outputs prevent automatic parsing and downstream utilization, thereby necessitating manual intervention. Such reliance on human post-processing is inefficient and incompatible with pre-expert flood risk estimation scenarios, where low human effort is a key requirement.
(2) The conditional accuracy given valid outputs reflects the correctness of predictions restricted to outputs that are syntactically and structurally valid. The results show that, even when base LLMs successfully generate valid outputs, their estimation accuracy remains limited. This deficiency primarily arises from the lack of domain-specific adaptation to flood risk patterns. Flood risk estimation is a highly specialized task that requires expert judgment to interpret spatial, environmental, and infrastructural information in relation to flood hazards. Such judgment is typically acquired through professional practice and accumulated empirical experience, rather than through the general linguistic and commonsense knowledge captured by large-scale pretraining corpora. Consequently, without domain-specific supervision, general-purpose LLMs lack the inductive bias necessary to align their reasoning with expert-informed flood risk assessment practices, leading to unreliable predictions.
In contrast, the proposed Flood-LLM framework achieves a notable performance improvement. The Learnable Estimation Agent fine-tunes the core LLM using limited expert-labeled data via the LoRA algorithm, enabling it to approximate expert decision-making while maintaining interpretability. This training process effectively encodes expert knowledge into the model, allowing it to internalize which patterns and attributes are considered relevant for flood risk estimation. As a result, Flood-LLM approximates expert decision-making in a scalable and interpretable manner, leading to consistently superior performance across all flood types.
3.3. Disaggregated Performance by Flood Presence and Severity
We visualize the flood risk estimation outputs of Flood-LLM using both the
LLaMA3.2-1B-Instruct and
Qwen3-1.7B-Instruct, and compare them with the official flood risk map published by the Brisbane City Council (see
Figure 10). More detailed quantitative results, including the Affected Property Area (A.P.A) and the per-level accuracy (L.Acc), are provided in
Table 5 (This study uses the lot area attribute from the property parcel dataset and aggregates it to quantify the Affected Property Area exposed to different flood types and risk levels). We visualize the flood risk estimation outputs of Flood-LLM using both the
LLaMA3.2-1B-Instruct and
Qwen3-1.7B-Instruct, and compare them with the official flood risk map published by the Brisbane City Council (see
Figure 10). To further characterize model performance, we additionally examine the directional error structure using confusion matrices for each flood type (see
Figure 11).
Taken together, the visual and quantitative results indicate that Flood-LLM achieves high reliability in identifying flood presence, while exhibiting a consistent tendency toward heavier classifications when differentiating flood risk levels.
In terms of binary flood presence, the predicted maps align closely with the official Council assessments, achieving high accuracy in distinguishing flooded from non-flooded areas across all flood types. This indicates that Flood-LLM effectively captures the primary spatial extent of flood exposure, even in complex urban settings. The robustness of this binary performance further suggests that the integration of heterogeneous contextual information and spatial neighborhood reasoning enables reliable identification of flood-affected areas at the city scale.
With respect to the distribution of flood risk levels, a systematic bias is observed across all flood types: lower risk levels (Levels 0–1) tend to be overestimated, whereas the highest risk level (Level 3) tends to be underestimated relative to the ground truth. This pattern indicates that both Flood-LLaMA and Flood-Qwen adopt a conservative severity allocation, inflating lower-risk cases while attenuating high-severity inundation and redistributing parcels toward intermediate risk tiers. As a result, the predicted flood extents exhibit a compressed risk spectrum that favors intermediate severity and reduces the spatial footprint of extreme flooding. By comparison, while the LLaMA-based model produces a level-wise property area distribution that appears slightly closer to the Council maps, it exhibits marginally lower accuracy than Qwen, particularly in distinguishing adjacent flood risk levels. This difference can be attributed to the inherent limitations of current large language models in mathematical reasoning and quantitative calibration, despite their strong capabilities in qualitative pattern recognition. Such behavior is consistent with well-documented weaknesses in mathematical and coding tasks.
3.4. Ablation Study
Our ablation study investigates whether the multi-domain contextual information integrated by the Relevant Information Agent significantly impacts flood risk estimation performance. As shown in
Table 6, we observe three key findings:
First, the models successfully capture major flood threat patterns, with particularly strong performance on creek (C. Flood: 88.80–90.40%), river (R. Flood: 88.55–90.75%), and storm-tide floods (S. Flood: 91.75–95.45%). The performance degradation when removing specific contexts confirms their importance - hydrological removal reduces accuracy by 1.9–5.1% for waterway-related floods, while infrastructural removal causes the most significant drop for overland-flow floods (O. Flood: 6.7–7.5% decrease).
Second, the models demonstrate robust fallback capabilities, maintaining reasonable accuracy (all >73%) even when critical contexts are excluded. This suggests effective information redundancy, where remaining contexts can partially compensate for missing domains (e.g., historical data helping mitigate geospatial removal impacts).
Third, the complete Flood-LLM configuration consistently achieves optimal results across all scenarios, validating our multi-context integration approach. The performance advantage is most pronounced for complex flood types like overland-flow, where full-context Qwen3 outperforms its ablated versions by 6.5–14.0%.
3.5. Parameter Analysis
To examine the effect of the Div-max Neighbor Info Agent, we conduct a parameter analysis on the neighborhood size
k in Equation (
21) using the overland-flow flood dataset.
As described in Equation (
23), smaller values of
k provide less detailed neighborhood context, as most neighboring properties are summarized in the form “{this neighbor} is similar to {an already included neighbor}.” In particular,
corresponds to the absence of neighbor information, while
is excessively large for most properties, as it incorporates nearly all neighbors with minimal summarization.
The results are presented in
Figure 12, from which three key observations can be drawn. First, incorporating neighborhood information consistently improves the performance of Flood-LLM. When neighbor information is removed, the accuracy of Flood-LLaMA and Flood-Qwen decreases by
and
, respectively, highlighting the importance of neighborhood context in flood risk estimation.
Second, when k is too small (e.g., ), performance remains limited. This is likely because excessive summarization leads to substantial information loss, which adversely affects flood risk estimation.
Third, increasing k beyond an optimal range does not yield further performance gains. This suggests that essential neighborhood information is already captured, and adding additional neighbors may instead introduce noise or distract the LLM.
Based on these findings, we set in our implementation, as it achieves the best overall performance in this experiment and is therefore adopted for comparison with other settings.
3.6. Analysis of LLM Reasoning Processes
We also conduct several case studies to analyze the prediction workflow of the LLM. In
Table 7 and the
Box 1, we present the facts and chain-of-thought for a specific property generated by the Qwen model after SFT. In this example, although the model has not accessed any human-reasoning process, it systematically and correctly identifies the key factors needed to estimate the risk associated with each type of flood and derives the final result accordingly. Such chain-of-thought reasoning helps domain experts understand the LLM’s decision-making workflow, making the model’s behavior safer and more transparent.
Box 1. Chain-of-thought interpretation by Flood-Qwen3 for property 13 * 46
Here’s the reasoning behind the flood risk assessment for each type: Creek flood (0): No open waterways or gullies within 100 m of the target property. Neighboring properties with gullies have proper drainage pipes, reducing overflow risk. River flood (3): The property was flooded in 1974, 2011, and 2022, indicating a recurring high-risk pattern. Multiple neighboring properties at the same elevation (7.13 m AHD) also experienced these floods. The elevation (7.13 m) is likely within a major river’s floodplain. Storm tide flood (0): The property is ∼20.4 km from the coastline, far enough to avoid storm surges. No historical records suggest storm tide flooding. Overflow flood (2): No major gullies or drainage issues directly near the property. However, past floods (1974, 2011, 2022) suggest drainage systems may be overwhelmed during extreme rainfall. Some neighbors have gullies with drainage pipes, but the target property itself lacks direct drainage infrastructure. Final Output: [0, 2, 0, 1]
4. Discussion
4.1. Multi-Domain Integration and the Complexity of Urban Flood Risk
The ablation experiments demonstrate that combining geographical, hydrological, drainage-infrastructure, and historical flood information substantially enhances predictive performance. In each variant where one data domain is removed, the model exhibits a notable reduction in accuracy, indicating that urban flood risk emerges from the interaction of multiple physical and infrastructural factors rather than from single-variable determinants. This finding reflects the real spatial complexity of urban environments, in which elevation, runoff pathways, pipe capacity, soil permeability, and past flood behaviors influence one another. By converting heterogeneous inputs into a unified textual representation, Flood-LLM can associate diverse information sources in a manner that resembles aspects of expert flood assessment, suggesting a possible way to address data-integration challenges in flood risk analysis.
4.2. Neighborhood Context and the Identification of Localized Vulnerabilities
The parameter analysis further shows that the incorporation of spatially diverse neighboring properties significantly improves the model’s ability to detect localized flood risks. Urban flooding often depends on micro-topographic variations, subtle drainage connections, and small-scale runoff patterns that are not adequately captured when using only the target property’s attributes. The improvements introduced by the Div-max Neighbor Info Agent indicate that effective flood prediction requires not only spatial proximity but also contextual diversity. This capability reflects how hydrological practitioners interpret flood behavior: properties that appear similar in elevation or land use may experience different hazards depending on upstream flow paths, pipe networks, or historical recurrence patterns. The findings therefore, highlight the importance of contextual reasoning in enhancing the accuracy of risk assessments in heterogeneous urban settings.
4.3. Prospects and Reflections on the Framework’s Methodological Potential
This study shows that Flood-LLM achieves effective learning from heterogeneous urban inputs and produces observable reasoning patterns during prediction that align with expert-derived flood risk labels. These findings point to two prospective methodological strengths of the framework. First, the training paradigm, which associates diverse urban indicators with planning labels, suggests the possibility of adapting the approach to different planning contexts through retraining on locally defined labeling schemes. Second, the use of transparent reasoning chains may support interpretation and oversight by making visible how the model connects terrain conditions, infrastructure indicators, waterway proximity, and historical flood information when forming predictions. In any potential future application context where model outputs might diverge from expert judgment, these reasoning traces could provide a basis for examining which spatial cues were emphasized, potentially assisting experts in identifying whether discrepancies arise from data limitations, contextual ambiguity, or model bias rather than opaque computational processes.
However, it is important to emphasize that these considerations remain exploratory and relate to possible future developments rather than present-day applications. The findings demonstrate the framework’s methodological potential, while any practical relevance would depend on further technical refinement together with careful development of appropriate governance, accountability, and ethical frameworks.
4.4. Opportunities and Limitations of LLM-Based Flood Risk Models
Despite the model’s overall performance advantage over RF, MLP, and GCN baselines, the confusion matrix reveals misclassifications in intermediate risk categories. Flood-LLM is more reliable in distinguishing flood presence versus absence than in differentiating fine-grained risk levels, particularly for properties whose outcomes depend on sensitive hydrodynamic behaviors or infrastructure thresholds not fully captured in text. This limitation reflects a broader challenge in applying LLMs to flood prediction: models excel at synthesizing multi-domain qualitative information but may lack the numerical precision required for borderline distinctions. A promising direction for future work is therefore to decouple the prediction task by first focusing on flooded-area identification, followed by targeted optimization for risk level classification within flooded regions.
The confusion matrix also suggests a tendency toward relatively conservative classifications, indicating a general shift away from extreme categories toward intermediate ones. This pattern reflects a degree of imprecision in how the model currently associates spatial cues with planning labels. If applied in practical contexts, such imprecision could influence outcomes, and therefore, future work would need to focus on calibration strategies and threshold design to further refine the model’s behavior.
Several methodological limitations identified in this study point to opportunities for further refinement:
- 1.
The flood risk labels adopted for model training suffer from inherent limitations arising from heterogeneous sources and asynchronous update cycles. Overfitting to such labels may lead the model to learn spurious spatial dependencies, implying that future research could benefit from labeling sources with higher internal consistency and synchronous updating.
- 2.
The merging of low and very-low risk categories reduces label granularity, suggesting that future work may benefit from retaining finer label distinctions to improve sensitivity to subtle spatial variations in flood exposure.
- 3.
Although training and testing labels were spatially separated, adjacent areas still share a continuous spatial context observable to the model, suggesting that future studies may explore more spatially robust evaluation strategies to better isolate potential spatial dependence effects.
- 4.
The use of a single fixed buffer distance to characterize drainage proximity may overlook variations in how drainage configurations influence parcel-level flood exposure, suggesting that future work may benefit from exploring multiple or adaptive spatial extents to more accurately capture how drainage configurations relate to parcel-level risk patterns.
- 5.
To limit computational cost, we used relatively small-scale LLMs with restricted generalization capacity, focusing on fine-tuning rather than prompt engineering. Future work with larger or more capable models may explore prompt-based improvements beyond the scope of this study.
- 6.
Despite promising performance, LLMs remain inferior to human experts in reasoning, error avoidance, complex knowledge use, and accountability. This study presents a research direction rather than a mature solution. Flood-LLM may still generate erroneous inferences, such as spurious spatial dependencies, and requires further refinement before it can approach practical use.
In addition, several directions for future development emerge from this work. Integrating additional multi-modal data sources, such as remote sensing imagery, rainfall radar observations, and outputs from physics-based hydrodynamic simulations, may enhance the model’s sensitivity to fine-scale physical processes. Extending the framework to incorporate temporal dynamics could enable analyses under evolving climatic or infrastructural conditions. Further evaluation is also needed to assess the robustness of the approach in cities with sparse drainage data or highly irregular terrain. Incorporating basic hydrologically relevant datasets, such as drainage network hydraulics, may also help the model better capture flow-related spatial patterns and reduce systematic biases in level estimation.
5. Conclusions
This study presents Flood-LLM, a multi-agent large language model framework for exploring how heterogeneous urban data can be translated into parcel-level flood risk estimations through structured narrative reasoning. The Relevant Info Agent organizes parcel-level geospatial, elevation, drainage, waterway, and historical flood information into structured descriptions. The Div-Max Neighbor Info Agent extends this representation by identifying relevant neighboring parcels and constructing a broader neighborhood-scale spatial narrative. The Learnable Estimation Agent, supported by LoRA-based fine-tuning, then relates the combined parcel- and neighborhood-level narratives to ordered risk categories in an interpretable manner. When applied to Brisbane, the results show that the framework can approximate expert-derived spatial risk patterns while making visible how different spatial cues contribute to predictions.
Future research may also extend the framework by incorporating additional multi-modal and temporally dynamic data, such as remote sensing imagery, rainfall radar observations, hydrodynamic simulation outputs, and drainage hydraulics, as well as by examining its behavior in cities with different data availability and terrain conditions.
In conclusion, this research suggests that large language models may offer a way to interpret complex urban spatial information for flood risk estimation, particularly in contexts where detailed hydraulic modeling is not readily available. Flood-LLM illustrates how such AI-based approaches can be structured around transparent reasoning and heterogeneous spatial data. The framework still requires further refinement and optimization, and any potential future application would need to give careful consideration to transparency, accountability, and ethical governance.