A Novel Multimodal Large Language Model-Based Approach for Urban Flood Detection Using Open-Access Closed Circuit Television in Bandung, Indonesia
Round 1
Reviewer 1 Report (New Reviewer)
Comments and Suggestions for AuthorsIn the presented manuscript, the authors performed a study on Urban Flood Detection by using a large language model applying CCTV images. The topic is quite interesting as the CCTV images can be available in most of the urban areas. The idea of using this type of data for flood detection is valuable. I have following suggestions for the authors,
-The introduction section is well organized. However, it is not clear whether the LLMs were used for fluvial floods recently or not. Is this research unique in using LLMs and CCTV for Urban flood modeling? Please specify better at the end of the introduction section.
-Figure 2 is not clear. I understand showing multiple objects such as CCTV in one map is not easy. However, a higher resolution map can be added to show the boundaries of the city. It will be helpful to understand the density of CCTVs.
-What do you mean by “Manuel Annotation” Did you classify all images one by one ?
-Did you consider performing a “fine-tuning” for the best LLMs model for a specific region? How do you think it will affect the success of the models? I believe a fine-tuned model for a highly vulnerable area will increase the reliability of the detections.
-About the results, I can understand that you are trying to generate affected area maps by using “flood” or “no flood” information. But in conclusion, it is not clear why it is important for us? Or how will it be affecting the end user? I believe that if we could derive the early warning signals from this type of detection, it would be much more meaningful.
Author Response
Reviewer 1
Comment_1: The introduction section is well organized. However, it is not clear whether the LLMs were used for fluvial floods recently or not. Is this research unique in using LLMs and CCTV for Urban flood modeling? Please specify better at the end of the introduction section.
Response_1: Thank you for this valuable comment. We agree that clarification is needed. Based on our literature review, the use of sensor technology and CCTV-based image detection is already well established for fluvial floods, since monitoring is performed directly on water bodies (e.g., rivers) where conditions are relatively static and more suitable for conventional image detection methods. In contrast, for pluvial floods—which are typically caused by urban drainage overflow—the application of CCTV and image-based flood detection remains very limited, mainly due to the dynamic and complex urban environment (e.g., human activities, vehicles, obstacles). Furthermore, we did not find prior studies that employed LLMs for either fluvial or pluvial flood detection. It is also important to clarify that the purpose of this study is not urban flood modeling, but rather urban flood monitoring. Instead of using conventional image detection approaches, which require extensive training datasets that are often unavailable for urban areas, we explore the potential of LLMs as a lightweight alternative for image-based flood monitoring. Thank you very much for your previous comments and we have modified and added the content in lines 69 to 75 and lines 101 to 111 as
“Although the use of CCTV in flood-related applications has been widely explored, it has mostly focused on fluvial flood monitoring. In contrast, the utilization of CCTV in urban areas to monitor pluvial floods remains limited and underdeveloped. Deploying physical sensors in urban environments is often impractical due to infrastructure complexity and vulnerability to vandalism. Thus, the potential of CCTV as a passive, non-intrusive data source for monitoring urban flood detection and pluvial flooding warrants further exploration” and “This innovation of this study is to develop an innovative flood monitoring framework by integrating open-access CCTV data with the visual recognition capabilities of LLMs. Unlike traditional flood monitoring systems, which require substantial investments in physical infrastructure and trained personnel, the proposed method offers a cost-effective and scalable solution for detecting flood events, supporting real-time surveillance and early warning systems. To our knowledge, no prior studies have employed LLMs for detecting fluvial or pluvial flooding. Meanwhile, instead of traditional image-based methods that depend on large, often unavailable training datasets, we investigate the use of LLMs as a lightweight and scalable alternative for flood detection from imagery. This approach is particularly valuable in regions where conventional sensor-based data are limited or unavailable.”
Comment_2: Figure 2 is not clear. I understand showing multiple objects such as CCTV in one map is not easy. However, a higher resolution map can be added to show the boundaries of the city. It will be helpful to understand the density of CCTVs.
Response_2: Thank you for this valuable suggestion. Figure 2 in the original manuscript was a screenshot from the official website. We agree with the reviewer’s comment and have replaced it with a higher-resolution map that we created ourselves, including the city boundaries and clearer visualization of CCTV density.
Comment_3: What do you mean by “Manual Annotation”? Did you classify all images one by one?
Response_3: Thank you for this valuable question. Yes, that is correct. We manually classified all collected images into four categories: flood-day, no flood-day, flood-night, and no flood-night. This manual annotation process ensured that the dataset used for testing was consistently labeled and reliable.
Comment_4: Did you consider performing a “fine-tuning” for the best LLMs model for a specific region? How do you think it will affect the success of the models? I believe a fine-tuned model for a highly vulnerable area will increase the reliability of the detections.
Response_4: Thank you for this valuable comment. Yes, we agree that fine-tuning a large language model (LLM) for a specific region could enhance the reliability of detections. In the present study we did not perform fine-tuning, as region-specific datasets such as historical flood reports or hydrological records are limited, and any attempt would likely not yield optimal results. Our models were therefore applied using only CCTV image inputs, without incorporating contextual information. Besides, as previously mentioned, there is limited research on integrating CCTV imagery with LLMs. Therefore, the primary aim of this study is to evaluate the feasibility and general performance of this novel application and integration.
Future work may investigate both fine-tuning with more comprehensive localized datasets and the integration of historical flood information as additional input at inference time. At the same time, we acknowledge that fine-tuning requires large, curated datasets and may not always yield accuracy gains proportional to its cost. Therefore, alternative strategies such as prompt engineering, domain adaptation, or few-shot learning will also be considered, as they may provide comparable improvements with lower resource requirements.
To enhance our manuscript regarding this, we add the following idea in the discussion part which could be find in lines 547-552
“Third, in this study, the maximum accuracy achieved was about 85%. Fine-tuning the models for specific regions could be one way to increase reliability, but at the current stage such an effort would be premature, as region-specific datasets are not yet sufficient. For now, fine-tuning would bring little benefit compared to its cost, but it remains an important consideration for future research once more comprehensive local datasets become available.”
Comment_5: About the results, I can understand that you are trying to generate affected area maps by using “flood” or “no flood” information. But in conclusion, it is not clear why it is important for us? Or how will it be affecting the end user? I believe that if we could derive the early warning signals from this type of detection, it would be much more meaningful
Response_5: Thank you for this valuable comment. We would like to clarify that the main goal of this study is urban flood monitoring rather than developing a real-time early warning system. The motivation comes from the lack of reliable flood-related information in urban areas, where pluvial floods are highly dynamic and not easily captured by traditional hydrological sensors. By utilizing open-access CCTV images combined with LLM-based flood detection, this approach enables continuous monitoring and provides spatially distributed information on flood occurrences at the city scale.
For end users, this type of monitoring is important because it can:
- Provide supporting evidence for decision makers during flood response and post-event analysis.
- Serve as a complementary data source for calibration and validation of hydraulic or hydrological models.
- Enhance situational awareness in urban areas where conventional flood gauges are limited.
We agree that in the future, the system could be extended into an early warning application. However, this study serves as a proof of concept to demonstrate that LLMs can be used as a lightweight tool for flood monitoring using existing CCTV infrastructure, particularly in data-scarce urban environments.
We realize that the benefits of the information need to be made clearer. Therefore we add the explanation in line 516-524. “The flood occurrence information derived from CCTV classification results provides several practical benefits. First, it enables decision makers to quickly identify areas that are frequently inundated, thereby supporting more effective flood response and resource allocation. Second, the spatial and temporal patterns revealed by this information can serve as valuable input for the calibration and validation of hydrodynamic or flood forecasting models, particularly in data-scarce urban environments. Third, by leveraging existing CCTV infrastructure, the information enhances situational awareness and supports continuous urban flood monitoring without the need for costly new sensor deployments.”
Author Response File:
Author Response.pdf
Reviewer 2 Report (New Reviewer)
Comments and Suggestions for AuthorsThe manuscript entitled “A Novel Multimodal Large Language Model-Based Approach for Urban Flood Detection Using Open-Access Closed Circuit Television in Bandung, Indonesia” presents an interesting approach to urban flood monitoring by leveraging open-access CCTV imagery and multimodal large language models (LLMs). The topic is timely and relevant, addressing an important aspect in urban flood monitoring, particularly in data-scarce environments. The manuscript is generally clear, logically organized, and well-structured.
However, several critical issues require attention before the manuscript can be considered for publication:
- Comparison with Existing Deep Learning Models: The study demonstrates promising results with classification accuracy up to 85% and discusses trade-offs between model performance and cost. However, the experimental design focuses exclusively on LLMs and neglects comparison study with established deep learning image classification models (e.g., MobileNet, VGG, ResNet, EfficientNet). These models are well-documented to achieve accuracies in the 80–90% range even with limited training data. Since the classification task in this study is relatively simple (four classes: daytime flood, nighttime flood, daytime dry, nighttime dry), it remains unclear what advantages LLMs offer over conventional models. A more thorough comparative study is needed to demonstrate the specific added value of LLMs. For example, the authors could investigate under which specific available dataset sizes LLMs outperform conventional models or provide added flexibility. Such a comparison would help establish a clear bridge between this study and existing approaches, as well as clarifying the unique benefits of LLMs for flood monitoring tasks.
- Choice of Classification Classes: The four-class scheme (daytime with flood, nighttime with flood, daytime dry, nighttime dry) raises concerns. The distinction between daytime and nighttime is easily inferred from CCTV metadata and is not particularly valuable for flood model validation. Rather, daytime/nighttime should be treated as a condition affecting classification performance, not as an explicit classification output. More meaningful classes would focus on hydrological relevance, such as a three-class system (dry / wet / flooded), which would provide greater value to end users such as urban flood modellers. The authors are encouraged to reconsider their classification design to ensure that the outputs are both scientifically meaningful and practically useful.
- Methodological Transparency: As the study relies on proprietary LLMs, the reproducibility of results is limited. The manuscript should provide more detail on prompt design, API settings, and evaluation protocols to ensure scientific rigor and allow readers to replicate or extend the work. Moreover, the implications of relying on “black box” systems for critical applications such as flood early warning should be discussed more explicitly, including potential risks related to reliability, interpretability, and accountability.
In summary, this manuscript addresses an important problem and introduces an innovative concept, but its scientific contribution remains unclear without stronger benchmarking against established methods, a more meaningful classification framework, and improved methodological transparency. These revisions are essential to demonstrate the true novelty and robustness of the proposed approach.
Author Response
Reviewer 2
Comment_1: Comparison with Existing Deep Learning Models: The study demonstrates promising results with classification accuracy up to 85% and discusses trade-offs between model performance and cost. However, the experimental design focuses exclusively on LLMs and neglects comparison study with established deep learning image classification models (e.g., MobileNet, VGG, ResNet, EfficientNet). These models are well-documented to achieve accuracies in the 80–90% range even with limited training data. Since the classification task in this study is relatively simple (four classes: daytime flood, nighttime flood, daytime dry, nighttime dry), it remains unclear what advantages LLMs offer over conventional models. A more thorough comparative study is needed to demonstrate the specific added value of LLMs. For example, the authors could investigate under which specific available dataset sizes LLMs outperform conventional models or provide added flexibility. Such a comparison would help establish a clear bridge between this study and existing approaches, as well as clarifying the unique benefits of LLMs for flood monitoring tasks.
Response_1: Thank you for this constructive comment. We agree that comparison with conventional deep learning models is an important point. To address this, we conducted a preliminary study using Deeplab v3+ as a baseline. While Deeplab v3+ performed well in pixel-level segmentation, it frequently misclassified contextual situations, such as detecting rivers or drainage water as road flooding. In contrast, LLMs combine visual perception with semantic reasoning, enabling them to distinguish between standing water on roads and water confined within riverbanks. This contextual reasoning capability addresses a key limitation we observed in conventional segmentation-based approaches.
In terms of cost and efficiency, Deeplab v3+ is indeed computationally lightweight once trained. However, it requires substantial annotated datasets and fine-tuning to adapt to new environments, which reduces scalability and increases cost. LLMs, although more computationally demanding at inference, achieve robust results without retraining or dataset preparation, offering higher generalizability for data-scarce urban contexts such as Bandung.
We acknowledge that with careful fine-tuning and larger annotated datasets, conventional models could achieve higher performance, and we recognize this as an interesting avenue for future work. However, even from this preliminary comparison, the unique advantage of LLMs demonstrated in this study lies in their ability to provide prompt-driven, zero-shot flood monitoring directly from surveillance images, without the overhead of dataset construction and retraining.
Finally, we add the explanation in the manuscript which can be found in lines 425 – 463
“As part of a preliminary comparison, we are using Deeplab v3+, a benchmark deep learning image segmentation model, to assess its applicability for flood detection from CCTV images. The DeepLab V3+ model was trained on the flood image segmentation dataset [37], which was specifically designed to address the challenges of classifying heterogeneous flood-related scenes. This dataset includes five distinct semantic classes: background, dry regions (annotated in red), environment, flood areas (annotated in blue), and wet surfaces (annotated in yellow). By incorporating both hydrological features (flood, wet, dry) and contextual surroundings (environment, background), the dataset enables the model to learn nuanced spatial and spectral differences essential for accurate flood mapping.
While Deeplab v3+ demonstrated strong performance in segmenting water pixels, it frequently produced misclassifications when applied to real-world urban scenes. This difference is further illustrated in Figure 6 where Deeplab v3+ incorrectly identified the water within the drainage channel as flooding, and the wet road surface as flood. The LLM, in this case ChatGPT 4.1., not only classified the image as “no flood” with high confidence (0.90) but also provided a reasoning output: “The water level in the canal is normal and below the bridge, and the road appears dry and clear of water” for Figure 6a and “The road appears wet but not submerged; there is no visible standing water indicating a flood” for Figure 6b. These examples highlight the advantage of LLMs in combining visual recognition with contextual reasoning. Rather than relying solely on pixel-level water detection, the LLM understood the spatial relationship between the canal, the bridge, and the road surface, thereby avoiding a false positive flood classification. Such reasoning capability demonstrates how LLMs can reduce misinterpretations that are common in conventional segmentation-based methods.
It should be noted that models such as Deeplab v3+ are computationally efficient once trained and can achieve high accuracy in well-defined tasks. However, they require large annotated datasets and fine-tuning to generalize effectively across new urban environments. By contrast, LLMs achieve comparable or better robustness without retraining, relying instead on prompt-based zero-shot classification. This trade-off underscores the unique value of LLMs for data-scarce urban settings, where collecting and labeling extensive flood imagery is often impractical. While more comprehensive benchmarking with multiple deep learning models (e.g., ResNet, EfficientNet) could further extend this analysis, our preliminary findings suggest that LLMs provide clear advantages in terms of flexibility, contextual accuracy, and scalability for urban flood monitoring. Of course, the comparison between conventional deep learning models and LLMs in this study is still limited, and a more comprehensive evaluation will be conducted in future work.
|
(a) |
(b) |
Figure 6. Example of misclassification by Deeplab v3+. Dry regions (annotated in red), environ-ment, flood areas (annotated in blue), and wet surfaces (annotated in yellow)”
Comment_2: Choice of Classification Classes: The four-class scheme (daytime with flood, nighttime with flood, daytime dry, nighttime dry) raises concerns. The distinction between daytime and nighttime is easily inferred from CCTV metadata and is not particularly valuable for flood model validation. Rather, daytime/nighttime should be treated as a condition affecting classification performance, not as an explicit classification output. More meaningful classes would focus on hydrological relevance, such as a three-class system (dry / wet / flooded), which would provide greater value to end users such as urban flood modellers. The authors are encouraged to reconsider their classification design to ensure that the outputs are both scientifically meaningful and practically useful.
Response_2: Thank you for this insightful comment. We agree that, for end users such as flood modellers, a classification scheme based on hydrological relevance (e.g., dry / wet / flooded) would indeed provide more meaningful information. However, the present study was not designed to distinguish water depths or flood severity levels, as the LLMs employed here are not conditioned to quantify depth or differentiate between shallow wet surfaces and actual flooding. Instead, our classification framework was intended to evaluate model robustness under different illumination conditions, since night-time imagery represents a major source of error in CCTV-based flood detection. Therefore, the four-class scheme (daytime flood, nighttime flood, daytime dry, nighttime dry) was adopted to explicitly separate and analyze performance differences between day and night conditions.
We fully acknowledge the reviewer’s point and agree that future work could explore more hydrologically relevant class definitions (e.g., distinguishing wet from flood conditions) by combining LLM-based monitoring with additional datasets such as sensor measurements or numerical modelling outputs. This would provide a richer and more practically useful framework for operational flood monitoring and modelling.
Comment_3: Methodological Transparency: As the study relies on proprietary LLMs, the reproducibility of results is limited. The manuscript should provide more detail on prompt design, API settings, and evaluation protocols to ensure scientific rigor and allow readers to replicate or extend the work. Moreover, the implications of relying on “black box” systems for critical applications such as flood early warning should be discussed more explicitly, including potential risks related to reliability, interpretability, and accountability.
Response_3: We thank the reviewer for raising the important point regarding methodological transparency and reproducibility. We would like to clarify that this study does not aim to develop a real-time flood early warning system, but rather an automated flood monitoring system using CCTV footage from multiple locations integrated with LLMs. This distinction is critical, as our objective is not operational forecasting but spatial flood assessment.
To enhance reproducibility and address the reviewer’s concerns, we provide the following methodological details:
- Prompt Design:
We compared two types of prompts for the LLM-based flood detection task:
Simple Prompt: “Is there a flood in the image?”
Complex Prompt: A structured instruction requiring the model to return results in JSON format, including three fields:
- "flood" (boolean) – whether flood is detected,
- "confidence" (float between 0 and 1) – self-assessed certainty,
- "reasoning" (short explanation) – rationale for the decision.
By incorporating the "reasoning" field, the complex prompt introduces a degree of semi-explainability, mitigating the “black-box” nature of LLM predictions. This allows us to qualitatively assess whether the model’s reasoning aligns with observable image features.
- API Settings:
The CCTV images were retrieved using a standard HTTP GET request from publicly available CCTV web IPs. No further API modifications or parameter tuning were applied, ensuring a straightforward and reproducible data acquisition process.
For image analysis, the images were passed as input together with the designated prompt (simple or complex). The API call was executed through the standard chat/completions endpoint. Default temperature and decoding parameters were used (temperature = 0, top-p = 1), to minimize randomness and ensure consistency across runs. The JSON output from the complex prompt was parsed programmatically to extract "flood", "confidence", and "reasoning" fields.
- Evaluation Protocols:
Model performance was evaluated using the F1-score, which balances precision and recall, making it particularly suitable for binary classification tasks such as flood versus no-flood detection. This metric allows for rigorous and interpretable performance comparison between the simple and complex prompts.
- Transparency and Interpretability:
While we acknowledge that the LLM itself remains a proprietary system, our methodological framework enhances transparency at the prompt engineering and evaluation levels. The explicit reasoning component offers interpretability at the decision stage, while the F1-score provides a standardized measure for replicability.
We also note that while most of these protocols were already explained but some detail explanation is lacking and the order of protocols were not in order especially in Section 2, the API settings, particularly the parameters used for the models and the prompt design. We have now added this clarification to the revised manuscript (Lines 199-250).
“The design of prompts in this study followed a stepwise logical framework. The first consideration was the goal of the task, which is to classify CCTV images into flood or no-flood conditions. At its most basic, this requires only a binary decision. From this reasoning, the simple prompt was formulated, directly asking the model to decide whether flooding is present in the image. This represents the most minimal and efficient approach. However, relying solely on binary answers presents limitations. The results provide no insight into why the model made a certain decision, nor do they indicate the level of certainty behind the classification. In real-world flood monitoring, such information can be valuable for evaluating the reliability of automated detection. To address this gap, the complex prompt was designed, requiring the model to output three components in a structured JSON format: (i) a binary flood classification, (ii) a numerical confidence score between 0 and 1, and (iii) a short reasoning statement.
Therefore, for the simplified prompt, we used the design as follows:
Based on this road CCTV image, analyze if there is a flood or not. Just reply with a YES or NO.
And for the more complex prompt we use:
Based on this road CCTV image, analyze if there is a flood or not. Use this JSON example format:
{
"flood":true,
"confidence":0.95,
"reasoning": "The road is submerged in water, indicating a flood."
}
If there is a flood, set "flood" to true; otherwise, set it to false. The "confidence" should be a float between 0 and 1, indicating your confidence in the prediction. The "reasoning" should be a short explanation of your decision.
For the image analysis, each CCTV frame was submitted to the model together with the designated prompt (either simple or complex). The requests were made through the standard chat/completions endpoint, with decoding parameters set to temperature = 0 and top-p = 1. In LLM, decoding parameters such as temperature and top-p strongly influence the stability of predictions. The temperature parameter regulates how “random” the model’s outputs are: a high temperature encourages the model to explore less probable options, which can lead to varied answers even for the same image input, while a low temperature reduces variability and makes the model’s responses more consistent. At the extreme, setting temperature to zero makes the model always select the most likely option, ensuring stable predictions across repeated queries. Similarly, the top-p parameter (nucleus sampling) controls how much of the probability distribution is considered when generating outputs. A smaller top-p value restricts the choice to only the most likely tokens, while higher values expand the selection. Setting top-p = 1 means the model considers the full distribution without truncation, allowing the decision-making process to fully rely on the temperature setting.
In the context of this study, we intentionally set temperature = 0 and top-p = 1 across all LLMs. This combination reduces the effect of randomness in classification and ensures that the same image input yields consistent outputs. For flood detection tasks, this stability is important because the goal is to evaluate how well the models can distinguish flooded versus non-flooded scenes, not to assess the diversity of their possible answers. Choosing these values also helps highlight the effect of prompt design rather than being confounded by variability in the decoding process. For example, if temperature were set higher, the same CCTV frame could be labeled differently across repeated runs, introducing uncertainty unrelated to the underlying model capability. By fixing temperature = 0 and top-p = 1, the experimental results directly reflect the strengths and limitations of the models and prompts under evaluation.”
Round 2
Reviewer 2 Report (New Reviewer)
Comments and Suggestions for AuthorsThe manuscript entitled “A Novel Multimodal Large Language Model-Based Approach for Urban Flood Detection Using Open-Access Closed Circuit Television in Bandung, Indonesia” is a revision of the version I previously reviewed. In my earlier review, I raised three concerns: (1) the lack of comparison with existing deep learning models, (2) the choice of classification classes, and (3) methodological transparency.
In the revised submission, I acknowledge that concerns (2) and (3) have been adequately addressed. However, the primary concern regarding (1) remains unresolved. The authors chose Deeplab v3+ as the sole baseline for comparison. This choice is problematic because Deeplab v3+ is a pixel-level segmentation model, while the proposed LLM-based method performs a four-class classification task. This is essentially an “apples-to-oranges” comparison and does not provide a fair or meaningful benchmark.
A more thorough and consistent comparative study is still necessary to demonstrate the specific added value of LLMs for this task. In particular, the authors should investigate under which conditions LLMs outperform conventional deep learning approaches. For example, how do performance trends vary when training with different dataset sizes (e.g., 100 labeled images vs. 1,000 labeled images vs. 5,000 labeled images)? Clear experimental evidence is needed to quantify “which model performs better under which conditions.” Without this, the manuscript’s central claim of LLMs offering distinct advantages over existing methods remains insufficiently supported.
Author Response
Comment_1: The manuscript entitled “A Novel Multimodal Large Language Model-Based Approach for Urban Flood Detection Using Open-Access Closed Circuit Television in Bandung, Indonesia” is a revision of the version I previously reviewed. In my earlier review, I raised three concerns: (1) the lack of comparison with existing deep learning models, (2) the choice of classification classes, and (3) methodological transparency.
In the revised submission, I acknowledge that concerns (2) and (3) have been adequately addressed. However, the primary concern regarding (1) remains unresolved. The authors chose Deeplab v3+ as the sole baseline for comparison. This choice is problematic because Deeplab v3+ is a pixel-level segmentation model, while the proposed LLM-based method performs a four-class classification task. This is essentially an “apples-to-oranges” comparison and does not provide a fair or meaningful benchmark.
A more thorough and consistent comparative study is still necessary to demonstrate the specific added value of LLMs for this task. In particular, the authors should investigate under which conditions LLMs outperform conventional deep learning approaches. For example, how do performance trends vary when training with different dataset sizes (e.g., 100 labeled images vs. 1,000 labeled images vs. 5,000 labeled images)? Clear experimental evidence is needed to quantify “which model performs better under which conditions.” Without this, the manuscript’s central claim of LLMs offering distinct advantages over existing methods remains insufficiently supported.
Response_1: We thank the reviewer for the thoughtful feedback and we fully acknowledge that the comparison with DeepLab v3+ presents limitations, as it is a segmentation model while our proposed LLM-based method performs frame-level classification. This point is well taken, and we agree that the current benchmarking does not fully capture the range of conventional deep learning approaches that are directly tailored for classification.
At the same time, we would like to emphasize that the main goal of this study is not to provide a final benchmarking against all existing deep learning models, but rather to explore the feasibility of using large language models (LLMs) for urban flood detection. Unlike conventional computer vision models, LLMs bring a different perspective: they combine visual features with contextual reasoning, offering a novel paradigm for interpreting CCTV imagery.
Accordingly, our comparative framework focused on multiple LLM families and configurations, including ChatGPT, Gemini, DeepSeek, and Mistral, each tested in different versions (from full-scale models to lighter mini variants) and under varied prompting strategies. This design allowed us to observe how performance differs not only across architectures but also across model scales and prompt designs.
The results show that LLMs can indeed be applied to urban flood detection, but with important challenges. Performance is highly sensitive to prompt formulation, varies significantly across model families, and shows trade-offs between accuracy and efficiency when moving from full to mini versions. These findings underscore both the potential and the limitations of LLMs.
We agree that future work should include a broader and more systematic comparison with classification-based deep learning models (e.g., ResNet, EfficientNet, Vision Transformers) and should investigate performance trends across varying dataset sizes, as suggested by the reviewer. However, given the scope of the present paper, we frame our contribution as a first step in demonstrating the possibility of leveraging LLMs for this domain, rather than a final benchmarking study.
Author Response File:
Author Response.pdf
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsDear authors, please post your responses to comments in the second column of the table below. Use the attached file with the table.
The right column should contain information about the changes, how and where in the text this is taken into account.
Or, please, justify your disagreement.
|
Comment |
Authors' response |
|
1) Using abbreviations in the title of an article is very bad style. The text should contain an explanation of abbreviations when they first appear.
|
|
|
2) References to literary sources are formatted within square brackets (for example, Lines 40, 45, etc.). |
|
|
3) The quality of Figure 1 needs to be improved. How is the topographic data presented in the figure used? It would seem that the elevation data should play a key role in hydrology and flood patterns. |
|
|
4) What is the scientific novelty of the manuscript? What new scientific method is proposed? The promt for NN on page 9 looks like the main result. Are there any other scientific results? |
|
|
5) The quality of Figure 2 is unsatisfactory. What is the number in parentheses.
|
|
|
6) There is no reference to Figure 3 and, accordingly, there is no description of this figure. |
|
|
7) Why do the authors provide pseudocode in Figure 4? There is no discussion of this figure in the main part of the text. The essence of this pseudocode is that the images from the cameras are saved periodically. What is new here and what new method of data processing is proposed? |
|
|
8) The authors form a prompt and base their queries on the neural network such as “Does this image show signs of flooding?” or “Is the road flooded?” What new scientific method is proposed? What is the scientific result in this case? |
|
|
9) All the images in Figure 5 show the situation in the city, where we see a large number of people and cars. Everyone already knows about this flooding and walks on the water. Even at night, there are services (police, utilities) working around the clock in the city that see this water. It is unclear what kind of warning the authors are talking about in such situations. |
|
|
10) Commas must be after formulas (1), (2) and (3). Description of quantities in Lines 192-195 must be in academic style in accordance with the rules of writing English sentences. |
|
Comments for author File:
Comments.pdf
Author Response
Comment_1 : Using abbreviations in the title of an article is very bad style. The text should contain an explanation of abbreviations when they first appear.
Response_1 : Thank you for your valuable suggestions. We fully agree with this point and have revised the title into “A Novel Large Language Model-Based Approach for Urban Flood Detection Using Open-Access Closed Circuit Television in Bandung, Indonesia”.
Comment_2 : References to literary sources are formatted within square brackets (for example, Lines 40,45,etc)
Response_2 : Thank you for the observations. We agree with the reviewer’s comment, and all in-text citations have been revised to consistently follow the square bracket format as required by the journal’s referencing system.
Comment_3 : The quality of Figure 1 needs to be improved. How is the topographic data presented in the figure used? It would seem that the elevation data play a key role in hydrology and flood patterns.
Response_3 : Thank you for this constructive feedback. We agree that the original figure required improvement, and we have replaced Figure 1 with a higher-resolution version to enhance its visual clarity. Regarding the use of topographic data, the figure serves primarily to provide geographic and contextual background for readers, illustrating the location and general terrain of Bandung City. While the elevation data is not directly used in the flood detection analysis, it helps convey the rationale for focusing on Bandung as a flood-prone urban area. This context is important, as the city's bowl-shaped topography contributes significantly to its vulnerability to flooding. This rationale has already been explained in the original manuscript (Lines 106–112), which states: “This study was conducted in Bandung, the capital city of West Java Province, Indonesia. Geographically, Bandung is located within a tectonic basin surrounded by volcanic highlands, creating a topographical structure that resembles a large bowl or basin. This natural “bowl-shaped” formation causes water from the surrounding up-lands to flow and accumulate toward the city center during periods of heavy rainfall. As a result, Bandung is highly susceptible to urban flooding, particularly in low-lying areas where drainage is inadequate." To improve clarity, we have also adjusted the figure caption and emphasized this point in the revised text that can be seen in the revised manuscript lines 119 - 122 “Figure 1. Topographic map and administrative boundary of Bandung City. The city lies within a tectonic basin surrounded by volcanic highlands, forming a bowl-shaped terrain. This unique topography causes water from surrounding uplands to flow toward the city center, increasing the risk of flooding in low-lying urban areas.”
Comment_4 : What is the scientific novelty of the manuscript? What new scientific method is proposed? The prompt for NN on page 9 looks like the main results. are there any other scientific results?
Response_4 : Thank you for raising this important point. We would like to clarify that the abbreviation “NN” does not appear in the manuscript, and we assume the reviewer is referring to the prompt-based approach using Large Language Models (LLMs) as described on page 9. The scientific novelty of our manuscript lies in the application of LLMs for image-based flood detection using publicly available urban CCTV feeds, which based on our literature review has not been explored in previous studies. Most prior works have focused on fluvial flood detection using physical sensors or traditional CNN-based vision models. In contrast, our work leverages the emerging capabilities of multimodal LLMs through a lightweight, prompt-based interface without the need for training or fine-tuning. This has been explicitly mentioned in Introduction (Line 98-103) “This study aims to develop a new flood monitoring framework for urban areas by combining open-access CCTV data with the visual recognition capabilities of LLMs. By using a lightweight, prompt-based approach, the proposed method offers an efficient and scalable way to detect flood events. It also supports real-time surveillance and early warning systems, especially in areas where conventional sensor-based data may be limited or unavailable.” Furthermore, the study presents additional scientific results beyond the prompt itself including:
- Comparative evaluation of four LLMs (GPT-4.1., Gemini, Mistral, Deepseek) based on F1 accuracy, speed, and cost (Section 3.1.)
- Analysis of GPT-4.1. variants (nano, mini, full) under different lighting/image settings (day and night) (Section 3.2)
- Evaluation of prompt design and its effect on accuracy and cost (Section 3.3.)
- Performance under standard and hard visual cases to assess robustness across varying image quality (Section 3.4)
We hope this clarification helps explain the novelty and contributions of our work.
Comment_5 : The quality of Figure 2 is unsatisfactory. What is the number in parentheses?
Response_5 : Thank you for your comment. Figure 2 was captured from the official public CCTV platform of the Bandung City Government, and we have made our best effort to present the image clearly and faithfully based on the quality available on the website. As this a real-time monitoring interface, the resolution is limited by the platform itself. Therefore, the current version is the best quality obtainable. Regarding the number in parentheses, we assume the reviewer is referring to the citation (23). This is a reference to the website source [23], as listed in there reference section, and is not part of the figure label or content. To avoid the confusion, we revise the Figure 2 caption into “Figure 2. User interface of the official Bandung City CCTV monitoring platform [23]”. The revised text could be seen in the revised manuscript line 141
Comment_6 : There is no reference to Figure 3 and, accordingly, there is no description of this figure.
Response_6 : Thank you for your comment. We would like to clarify that Figure 3 presents the workflow of the proposed system, and its detailed description is provided in Lines 142-147 of the original manuscript. However, we acknowledge that the figure was not explicitly referred to by name in the text. To address this, we have added a clear reference to Figure 3 in the revised manuscript shown in Line 149 “In general, the overall workflow of this model can be seen in the Figure 3.”
Comment_7 : Why do the authors provide pseudocode in Figure 4? There is no discussion of this figure in the main part of the text. The essence of this pseudocode is that the images from the cameras are saved periodically. What is new here and what new method of data processing is proposed.
Response_7 : Thank you for your comment and opportunity to clarify. Figure 4 presents the pseudocode for our custom data collection system, which was developed to automatically capture and organize images from over 340 CCTV feeds every 15 minutes for 365 days. While the operation may appear simple at a glance, the implementation had to address several challenges, such as asynchronous feed availability, variable camera response, and data storage management at scale. Although the underlying logic is straightforward, we believe the inclusion of the pseudocode adds value by enhancing reproducibility for future researchers who may wish to replicate large scale image collection similar public CCTV environments. The discussion of Figure 4 has been explicitly mentioned in the original manuscript in Lines 152-160.
Comment_8 : The authors form a prompt and base their queries on the neural network such as “Does this image show signs of flooding?” or “Is the road flooded?” What new scientific method is proposed? What is the scientific result in this case?
Response_8 : Thank you for this insightful comment. We would like to clarify that the prompts such as “Does this image show signs of flooding?” or “Is the road flooded?” were mentioned in the manuscript only as illustrative examples (Section 2.3, Lines 178–189) to help readers understand the concept of prompt-based queries. In the actual study, we designed two distinct types of prompts to investigate how prompt complexity may affect LLMs performance: a detailed prompt (Original Manuscript Lines 310–319) and a simplified prompt (Original Manuscript Lines 322–323). This design choice was intentional and forms part of the scientific contribution of our study. The scientific result in this context is that prompt variation leads to noticeable differences in model accuracy, while having only minimal impact on cost and processing speed. As shown in Table 4 and discussed in Section 3.3, the detailed prompt improves classification accuracy by approximately 10% compared to the simple prompt, especially in smaller models such as GPT-4.1-nano. This suggests that richer prompt structures can help LLMs produce more accurate visual classifications without incurring substantial overhead. Furthermore, we would like to emphasize that this study does not aim to develop a new neural network architecture. Instead, the novelty lies in demonstrating the potential of Large Language Models (LLMs) as a lightweight, prompt-driven method for flood image recognition, particularly in urban environments where data availability and infrastructure are limited.
Comment_9 : All the images in Figure 5 show the situation in the city, where we see a large number of people and cars. Everyone already knows about this flooding and walks on the water. Even at night, there are services (police, utilities) working around the clock in the city that see this water. It is unclear what kind of warning the authors are talking about in such situations
Response_9 : Thank you for this thoughtful comment. We would like to clarify that Figure 5 was not intended to depict emergency response scenarios or real-time early warning applications, but rather to show representative examples of CCTV imagery collected during the study. The images were obtained from 340 camera locations over 365 days and were categorized into four conditions: flood-day, no flood-day, flood-night, and no flood-night. These categories served as the foundation for the classification and benchmarking of LLM performance. As mentioned in the Introduction section (Original Manuscript Lines 71–85), one of the key motivations for this study is the lack of consistent flood observation data in urban areas, which contrasts with the relatively well-instrumented nature of fluvial flood monitoring. In cities, flood sensors are difficult to install due to complex infrastructure, vandalism risks, and limited line-of-sight. Consequently, CCTV is one of the few consistent and accessible sources of visual flood evidence, and this study proposes a lightweight method to automatically utilize such data. Moreover, the objective of this work is not to create a real-time flood early warning system, but rather to support urban flood monitoring and situational awareness through existing visual data.
Comment_10 : Commas must be after formulas (1), (2), and (3). Description of quantities in Lines 192-195 must be in academic style in accordance with the rules of writing English sentences
Response_10 : Thank you for your suggestions. We agree with both points and have revised the manuscript accordingly. Commas have been added after references to formulas where appropriate. Additionally, the description of the variables have been rewritten in academic sentences form to ensure clarity and consistency, see in revised manuscript Lines 212-215 “Where TP (True Positive) denotes the number of flood images that are correctly identified as flood, FP (False Positive) denotes the number of non-flood images that are in-correctly classified as flood, and FN (False Negative) denotes the number of flood im-ages that are incorrectly classified as non-flood.”
Reviewer 2 Report
Comments and Suggestions for AuthorsTo detect urban pluvial floods in real-time is an important topic for flood management, which helps to enable flood warning and fast response. This manuscript analyses CCTV captured images using four popular LLM models and provides valuable insights for the application of LLM in flood hazards detection. The manuscript is well organised and deserves publication after the following questions have been addressed.
- In section 1, the introduction lacks relevant references to flood detection techniques.
- In section 2.3, the manuscript did not reason why LLM was choosen instead of traditional object detection methods like YOLO. YOLO might perform better in flood detection tasks at a much cheaper cost. It is thus worthwhile to compare LLM and traditional object detection methods like YOLO in flood detection.
- In section 2, the manuscript lacks a detailed description of object detection algorithms and techniques used by LLMs.
Author Response
Comment_1 : In section 1, the introduction lacks relevant references to flood detection techniques.
Response_1 : Thank you for the valuable feedback. We agree with the reviewer’s comment. In response, we have revised Section 1 of the manuscript by incorporating several additional and relevant references related to flood detection techniques to strengthen the context and background of the study that can be seen in revised manuscript Lines 74 – 76.
“In general, CCTV image or video data can be used alongside traditional image recognition techniques such as image/video segmentations [23–26] and object detections [27–30], to identify the presence of flooding within the captured scenes.”
Comment_2 : In section 2.3, the manuscript did not reason why LLM was chosen instead of traditional object detection methods like YOLO. YOLO might perform better in flood detection tasks at a much cheaper cost. It is thus worthwhile to compare LLM and traditional object detection methods like YOLO in flood detection.
Response_2 : Thank you for this valuable comment. We acknowledge that traditional object detection models such as YOLO have been widely and successfully used in flood detection tasks. However, the aim of this study was not to replace or outperform dedicated object detection methods, but rather to explore the potential of Large Language Models (LLMs) as a lightweight, flexible alternative for flood classification tasks—particularly in situations where pre-trained object detection models may not generalize well, or where labeled datasets and model fine-tuning are not available.
LLMs with visual understanding capabilities offer a zero-shot or few-shot classification approach, which is particularly useful in urban flood scenarios where the flood appearance can vary significantly across different camera views and lighting conditions. Unlike YOLO, LLMs can work directly with natural language prompts, removing the need to define explicit object classes or retrain models.
Furthermore, this study focuses on evaluating prompt engineering, model scalability, cost, and speed trade-offs, which are not directly comparable with traditional CNN-based models. That said, we agree that a direct comparison between LLM-based methods and traditional computer vision models like YOLO is valuable, especially from a performance and cost-efficiency standpoint. We will consider adding such a comparison in future work, possibly benchmarking both approaches on the same dataset to assess trade-offs in accuracy, speed, and generalization capabilities.
Finally, we add the reasoning to choose LLMs instead of traditional object detection model in revised manuscript. Lines 191-205 “While traditional object detection models have proven effective for visual flood detection tasks, the decision to use LLMs in this study is motivated by several key advantages. First, multimodal LLMs support zero-shot classification using natural language prompts, which eliminates the need for training labeled datasets or fine-tuning. Second, the urban flood scenes captured by city wide CCTV cameras exhibit significant variability in scale, viewing angle, and object occlusion conditions under which bounding-box-based object detectors may require extensive retraining and adaptation. Third, LLMs provide high flexibility in design through prompt engineering. Additional tasks—such as accident detection or traffic congestion analysis—can be incorporated simply by extending the output prompt (e.g., by adding keys like "is_there_accident" or "is_traffic_jam" to a JSON response), without any changes to the model weights or structure. Traditional object detection pipelines, by comparison, would require new datasets, retraining, and re-optimization for each added task. Thus, although object detection methods may be more efficient for well-defined tasks with consistent visual features, this study focuses on evaluating LLMs as a lightweight and adaptive alternative for scalable urban flood monitoring.”
Comment_3 : In section 2, the manuscript lacks a detailed description of object detection algorithms and techniques used by LLMs.
Response_3 : Thank you for your constructive comment. We acknowledge that the internal object detection algorithms and techniques used by LLMs are not described in detail in the original manuscript. This is primarily because the specific architectures and internal mechanisms of the LLMs used in this study (e.g., GPT-4.1, Gemini, Mistral, DeepSeek) are proprietary and not publicly available. Nevertheless, to address this concern, we have added a paragraph in revised manuscript Lines 259 - 271 explains how multimodal LLMs generally process visual inputs for flood detection.
“The LLMs employed in this study are multimodal models, capable of jointly processing visual and textual inputs. Upon receiving a CCTV image, these models utilize an internal vision encoder—typically based on transformer or convolutional architectures—to extract spatial and object-level features from the image. These features are subsequently integrated into the language reasoning component of the model, which interprets them in conjunction with prior learned knowledge, including common flood indicators and contextual visual patterns associated with urban street environments. It is important to emphasize that the specific internal architectures and training mechanisms of these models are not publicly disclosed, as they are proprietary. Consequently, this study does not seek to analyze the models’ internal structures in detail. Rather, the focus lies in evaluating their practical performance for flood detection using visual data, particularly in terms of accuracy, computational cost, and responsiveness under different prompt formulations and image conditions.”
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors
The author_response.docx file is empty and does not contain answers.
1) Line 166. The font `Figure 4` should not be bold. This applies to all similar places in the entire manuscript.
2) Line 189-192. The sentence contains a hyphen. What role does it play? This is not an English text. It is highly advisable to edit the text, removing the hyphens, which possibly act as a verb (?), which is not English. This applies to similar fragments in the entire text.
3) Lines 216 - 224. The use of Internet formatting is unacceptable.
The phrase `The F1 score could be defined as:` should be replaced with `We calculate the following characteristics:`.
Commas should come after formulas (1), (2) and (3).
The word `Where` is a continuation of the sentence and should be in lower case. Lines 218-224 are one sentence and should be formatted accordingly. The description of quantities in equations is traditionally formatted as: `where a is the …, b is the …., c is the … .`.
You write in Response_10 that you have corrected the remark. However, the second version does not contain the edits.
In addition, Response_10 contains a mention of edits in Lines 212-215. However, I do not see these changes in the second version of the manuscript.
In your Response_8, you confirm that the main tool of this study and the main scientific result is the prompt on Line 351 and below. It is unclear what simplified prompt (Original Manuscript Lines 322–323) you are talking about. These lines do not contain a prompt.
I find it difficult to agree that a GPT prompt for detecting water in an image can be a scientific result.
Comments for author File:
Comments.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThanks for the authors for addressing the questions raised.
